Automatic Tuning of Whole Applications:

A LACSI Symposium Workshop

Overview and Abstracts

October 12, 2005

Workshop Coordinator

Ken Kennedy, ken@rice.edu

Workshop Abstract

For many years, retargeting of applications for new architectures has been a major headache for high performance computation. As new architectures have emerged at dizzying speed, we have moved from uniprocessors, to vector machines, symmetric multiprocessors, synchronous parallel arrays, distributed-memory parallel computers, and scalable clusters. Each new architecture, and even each new model of a given architecture, has required retargeting and retuning every application, often at the cost of many person-months or years of effort.

    Recently a number of strategies have emerged for automating the process of tuning applications to new architectures, based on using large amounts of computation time to explore a space of different variants of the program, running each variant on the target architecture. One example of this strategy is the Atlas system, which uses substantive amounts of computation to provide versions of a computational linear algebra kernel that are highly tuned in advance to different machines. If this approach can be extended more generally to components and whole programs, it would help avoid the enormous human costs involved in retargeting applications to different machines. A major research question in this area remains: given that the space of variants can be enormous, how can we reduce tuning time to manageable levels?

    Recently a number of research groups have been pursuing work in this area.  We propose to hold a workshop at the 2005 LACSI Symposium to report on these efforts and to solicit feedback from the application development community.  The goals of this workshop will be:

  1. To provide a forum for research groups working on automatic tuning to report on the status of their efforts and their future plans;

  1. To foster collaborations among different autotuning research groups and the potential users of tools that they may develop; and

  1. To initiate an activity to develop a standard set of benchmarks for use in this research.

Presentation Abstracts

Application Performance Tuning Activities in LANL X Division

Hank Alme (LANL X-8)

Slides: PDF

A new team has formed in LANL's X division, charged with bringing the tools an method available to bear on improving performance of the main LANL weapons simulation codes. We will present an overview of the planned team activities, with an emphasis on the areas where we hope to be able to interact with the performance tuning community outside the lab.

Combining Models and Guided Empirical Search for Memory Hierarchy Optimization

Chun Chen, Jacqueline Chame, and Mary Hall (USC ISI)

Slides: PDF

This talk will describe an algorithm for simultaneously optimizing across multiple levels of the memory hierarchy for dense-matrix computations. Our approach combines compiler models and heuristics with guided empirical search to take advantage of their complementary strengths. The models and heuristics limit the search to a small number of candidate implementations, and the empirical results provide the most accurate information to the compiler to select among candidates and tune optimization parameter values. We will present performance results and discuss future directions. Notably our results on Matrix Multiply achieve comparable performance with ATLAS and vendor BLAS libraries.

Generic Code Optimization

Shirley Moore, Jack Dongarra, Keith Seymour, and Haihang You

Slides: PDF or PPT

Generic Code Optimization is a research effort to develop a tool that will allow a critical software segment to be analyzed and empirically optimized in a way that is similar to how ATLAS performs its optimization. It is a collaboration with the ROSE project, which involves source-to-source code transformation and optimization, at Lawrence Livermore National Laboratory.  The approach to optimizing arbitrary code, especially loop nests, includes the following components:  

  1. machine parameters detection

  1. source to source code generation

  1. test driver generation

  1. an empirical search engine

Research Directions in Automatic Tuning of Libraries and Applications

 

David Padua (University of Illinois at Urbana-Champaign)

 

Slides: PDF

I will discuss a few open problems in automatic tuning as well as our efforts at Illinois to address them. Topics include:

    1. Infrastructures to support the development of self-tuning code. We have the outline of an infrastructure built around language extensions to specify code generation and transformation as well as search strategy.

    2. Search strategies to identify the best version from the astronomical number of possibilities that are usually available. We have studied some search strategies based on Explanation Based Learning.

    3. Tuning techniques when performance depends on the input data. We have built a generator of sorting routines and shown that our approach produces what seems to be the fastest available sorting routine for sorting arrays of integers.

Using Direct-search in Automatic Tuning of Applications

Apan Qasem, Ken Kennedy, John Mellor-Crummey (Rice)

Slides: PDF or PPT

Exploring the large and complex transformation search space is

one of the main obstacles in developing efficient and practical tools for

automatic tuning of whole applications. To address this issue, we have

developed a prototype tool that uses loop-level performance feed-back and

a direct-search strategy to effectively explore the optimization search space.

In our talk, we will give an overview of our autotuning framework

and present results from experiments using our search strategy.

ROSE: Source-to-Source Analysis and Optimization

Dan Quinlan (LLNL)

Slides: PDF or PPT

ROSE is an open source tool for the optimization of C and C++ scientific applications.  ROSE is specifically a C++ library for building source-to-source translators and analysis tools for large scale DOE scientific applications.  ROSE provides a simple object-oriented compiler infrastructure targeted at a general non-compiler audience.  Our focus in on the optimization of scientific applications, but ROSE can and has been used by others for the development of highly specialized source-based analysis tools. A specific focus within our project is on robustness and loop optimization, so that large scale DOE applications can be optimized. Recent work has focused on the use of ROSE as a basis for empirical optimization (automatic tuning).  Work with Rice has also developed initial FORTRAN support to ROSE.

Adaptive Inlining

Todd Waterman and Keith Cooper (Rice)

Slides: PDF or PPT

Procedure inlining is a complex optimization that has been the subject of significant research over the years, but prior techniques have all had limited success. Adaptive techniques have recently emerged as a method for improving compiler performance with the primary focus being on optimization order. We present an adaptive inlining system that finds a program-specific set of inlining decisions. This results in consistently better program performance than a static inlining technique is capable of achieving.

Tuning High Performance Kernels through Empirical Compilation

Clint Whaley (University of Texas, San Antonio)

Slides: PDF

There are a few application areas which remain almost untouched by the historical and continuing advancement of compilation research.  For the extremes of optimization required for high performance computing on one end, and embedded systems at the opposite end of the spectrum, many critical routines are still hand-tuned, often directly in assembly. At the same time, architecture implementations are performing an increasing number of compiler-like transformations in hardware, making it harder to predict the performance impact of a given series of optimizations applied at the ISA level. These issues, together with the rate of hardware evolution dictated by Moore's Law, make it almost impossible to keep key kernels running at peak efficiency.  Automated empirical systems, where direct timings are used to guide optimization, have provided the most successful response to these challenges. This paper describes our approach to performing empirical optimization, which utilizes a low-level iterative compilation framework specialized for optimizing high performance computing kernels. We present results showing that this approach can not only provide speedups over traditional optimizing compilers, but can improve overall performance when compared to the best hand-tuned kernels selected by the empirical search of our well-known ATLAS package.

Automatic Tuning of Sparse Matrix Kernels

Kathy Yelick (U.C. Berkeley and LBNL )

Slides: PDF

The Optimized Sparse Kernel Interface (OSKI) Library is a collection of automatically tuned computational kernels for sparse matrices, which is designed for use in solver libraries and applications. OSKI has a BLAS-style interface, providing basic kernels like sparse matrix-vector multiply and sparse triangular solve, among others. OSKI contains a set of optimizations such as data structure reorganizations that are specific to the matrix structure.  The optimizations currently target memory hierarchies on cache-based scalar processors, although work on vector processor and SMP optimizations is ongoing.  I will describe the optimizations and how the OSKI interface is designed to allow for performance information from offline analysis of the hardware performance, from user hints, and from runtime feedback.

This work is joint with Jim Demmel, Rich Vuduc, and other members of the Berkeley Benchmarking and Optimization (BeBOP) group.

Parameterization of Compiler Optimizations For Empirical Tuning

Qing Yi (University of Texas, San Antonio)

Slides: PDF

Conventional compiler optimizations are based on static approximation of

machine behavior and has been shown to be inadequate in many cases.

Empirical tuning can compensate the inaccuracies of static performance models

by selecting optimizations based on actual runtime information.

However, since conventional compilers provide very limited  

ways to adjust their application of optimizations, previous research

has focused on tuning a very small subset of optimization opportunities,

such as loop blocking and unrolling.

We believe that in order for empirical tuning to be successful, it needs to

be able to fully control the application of compiler optimizations.

We propose a framework where compilers systematically parameterize

their optimizations and produce an intermediate form that encodes  

optimizations with an explicit searching space. A separate code generator

can then be invoked by the empirical tuner to generate various versions

of the optimized code. As the empirical tuner has the complete freedom to

navigate the entire optimization search-space available to a compiler,

it will be much more effective in finding the best solution.

This work is in collaboration with several research groups and

is currently ongoing. I will present some preliminary results,  

with focus on combination of various

loop optimizations, including loop blocking, fusion and unrolling.

How Oblivious can Cache-oblivious Codes be?

Kamen Yotov (Cornell)

Slides: PDF or PPT

Cache-oblivious algorithms provide a solution to the problem of writing  programs that adapt automatically to memory hierarchies to optimize their performance. These algorithms, which are based on the divide-and-conquer paradigm, enjoy certain important theoretical properties such as I/O optimality, but there are few head-to-head comparisons of the experimental performance of cache-oblivious and cache-aware programs.

This talk describes such a study for matrix multiplication. Starting from code that is completely oblivious to machine architecture, we  successively add "awareness" to different architectural features by optimizing the code to take advantage of those features until we get a completely cache/architecture-aware code. Our experiments show that obliviousness has a significant penalty on current architectures, and that it will not be easy to eliminate this penalty.