Ken Kennedy

Technical Biography

November 2005

Ken Kennedy attended Rice University, receiving a B.A. in mathematics (summa cum laude) in 1967. He pursued graduate studies at New York University, where he earned a M.S. in mathematics in 1969 and a Ph.D. in computer science in 1971. He returned to Rice University in 1971 to join the faculty of the Mathematical Sciences Department, rising to the rank of professor in 1980. He founded the Rice Computer Science Department in 1984 and served as its chair until 1988. He was named the Noah Harding Professor of Computer Science in 1985. In 1997, he became the first John and Ann Doerr Professor of Computational Engineering and, in 2002, he was promoted to University Professor.

In his 34-year faculty career, Kennedy has directed 36 Ph.D. dissertations and published over 205 technical articles. The following paragraphs describe some of the most important technical accomplishments during that period.

Data Flow Analysis. Throughout his career, Ken Kennedy has conducted research on the optimization of code compiled from high level languages, especially Fortran. From 1970 to 1978, he worked on methods for global data flow analysis in support of code optimizations, contributing widely-used approaches for live analysis and reduction in strength (with Cocke and Allen). He also invented the “node listing” iterative method for data flow analysis, the only iterative approach to achieve near-linear worst-case running times. With his student Linda Zucconi, he also developed a general strategy for using graph grammars to generate linear-time algorithms for data flow on structured languages.

Attribute Grammars. With his first student, Scott Warren, Kennedy published a seminal paper on automatic generation of attribute grammar evaluators . This work defined the absolutely non-circular attribute grammars, the largest known class of attribute grammars that can be tested for circularity in polynomial time, and developed a generation algorithm for this class.

Vectorization. In 1978 and 1979, he spent a year at IBM Research where he began work on automatic vectorization. This research, conducted jointly with his student Randy Allen, led to the development of PFC, one of the earliest and most successful automatic vectorization systems for Fortran. Among the techniques pioneered in PFC are:

  1. The notion of loop-carried dependence and its use in the Allen-Kennedy multilevel code generation algorithm , the most widely-used vector code generation procedure;

  1. If conversion , the standard technique for vectorizing conditional loops;

  1. Fast multilevel dependence testing algorithms;

  1. Fast loop interchange techniques; and

  1. New strategies for improving vector register usage.

PFC was a model for vectorization in the IBM VS Fortran compiler, and it influenced the design of most other commercial vectorization systems.

The work on vectorization has also had a strong impact on uniprocessor instruction scheduling because vectorization can produce more operations that can be issued simultaneously on a superscalar or VLIW architecture. In addition, if conversion is the strategy of choice for converting instruction sequences to use hardware support for predicated execution, a feature of the Intel IA-64 architecture.

Implementation of Fortran 90. As a part of the PFC vectorization effort, Kennedy and Allen developed strategies for scalarization of array assignments, by which array assignments were converted to scalar loops (or loops containing fixed-length vector operations of the type found on most vector computers). They developed a number of transformations to minimize the amount of temporary storage (and copying) required for a correct implementation, including loop reversal and input prefetching . Subsequently, Kennedy and his student Yuan Zhao extended this work to achieve near optimal performance by incorporating loop alignment (sometimes called loop shifting ) and loop skewing , proving that the resulting scalarization was within a constant factor of the best possible.

Scalar Register Allocation. In collaboration with his colleagues Keith Cooper and Linda Torczon and Preston Briggs, whom he co-advised with Cooper, Kennedy was a co-inventor of the Chaitin-Briggs coloring-based register allocator , which was a major advance over previous coloring heuristics in that more interference graphs could be colored without spilling. This patented allocation strategy is now used in many commercial compilers and in the gnu suite of open-source compilers.

Memory Hierarchy Management. Building on vector register optimization techniques, Kennedy and his student Steve Carr, applied the techniques of vectorization to improving optimization of the use of the memory hierarchy on conventional uniprocessors. These techniques— scalar replacement and unroll-and-jam (sometimes called register blocking ), were developed in collaboration with John Cocke and David Callahan. Taken together they dramatically improved the allocation of subscripted variables to processor registers. This work has influenced the design of most optimizing compilers for scalar machines. In addition, scalar replacement is being incorporated into the gnu 4.0 compilers.

With David Callahan and Alan Porterfield, he also developed standard compiler strategies for compiler cache management, including work on cache blocking with alignment with Porterfield and one of the seminal papers on software prefetching of cache blocks, establishing that the technique could be extremely effective in practice.

In recent work with Chen Ding, John Mellor-Crummey, and David Whalley, he has extended memory hierarchy optimization to irregular computations. A particularly important result of this work is a new strategy for multilevel blocking on irregular problems.

Whole-Program Analysis . In the early 1980's Kennedy began to extend his methods for vectorization to automatic parallelization. He discovered that the key impediment to compiler parallelization was the limitation of global analysis techniques to single procedures. To overcome this limitation, he began work on the R n programming environment, which was designed to support whole-program analysis in an environment that permitted separate compilation. The work on R n led to the development of a number of new algorithms for interprocedural data flow analysis, including:

  1. A linear-time algorithm for flow-insensitive side-effect problems , which was developed with Keith Cooper;

  1. A linear-time algorithm for alias analysis , also with Cooper;

  1. An algorithm for interprocedural constant propagation , with David Callahan, Keith Cooper and Linda Torczon; and

  1. New algorithms for call-graph construction , with Mary Hall.

A major side effect of this work was the introduction of procedure cloning to improve interprocedural optimization. Cloning makes it possible to gain many of the benefits of inlining, such as compiling the code in the context of the call site, without actual textual substitution, which complicates local compilation.

In addition, the R n project pioneered a general strategy for whole program compilation that used a new tool, called a program compiler to manage the process. The program compiler is responsible for computing all interprocedural data flow information for the program and the invocation of the single-module compiler for all those program files that must be recompiled. To minimize the amount of recompilation after a change, the program compiler carries out a detailed analysis of the use of interprocedural information in each program file. This approach, developed jointly with Michael Burke, Keith Cooper, Mary Hall and Linda Torczon, was adopted in the Convex Applications Compiler, the first commercial compiler to perform interprocedural optimization on multi-file Fortran programs.

Parallelization. Building on the interprocedural analysis framework, Kennedy’s group extended the methods of vectorization to automatic parallelization. This work concentrated on the use of dependence to uncover opportunities for parallelism in shared memory multiprocessors. The parallelization research contributed a number of new algorithms, including:

  1. An algorithm for parallel code generation based on loop distribution, loop fusion, loop alignment and code replication, with Randy Allen and David Callahan,

  1. A linear-time algorithm for typed fusion , which maximally fuses all loops of the same “type” (parallel versus sequential, matching loop bounds, etc.) in an unweighted graph, and

  1. An algorithm for loop distribution with control dependence that makes it simple to parallelize of loops with conditionals, with Kathryn McKinley.

The latter effort solved an important technical subproblem, namely how to generate code from conditional dependence after numerous program transformations. This work made it possible to map any program dependence graph back to a legal Fortran program.

Interactive Parallelization. Kennedy's research on parallelization also led to the development of a new kind of tool that used static analysis to provide insights into why a program will not parallelize. The PTOOL browser, based on PFC, was the first such tool. In the late 1980's, Kennedy's group produced the ParaScope Editor , an advanced editor for Fortran 77 that incorporated these capabilities. The ParaScope research also led to the development of regular section analysis (with students David Callahan, Vasanth Bala, and Paul Havlak), the most common approach to interprocedural analysis of side effects to array subsections. ParaScope was widely distributed, and it is the basis of at least one commercial parallelization system, a Siemens product. It has influenced the design of numerous others, notably the Forge system from Applied Parallel Research.

CRPC. In 1988, Kennedy led a group of researchers from Caltech, Rice, Argonne National Laboratory, and Los Alamos National Laboratory in a proposal to the National Science Foundation to create the Center for Research on Parallel Computation (CRPC), one of the first NSF Science and Technology Centers. He served as director of the CRPC from its inception in 1989 to its conclusion in 2000. In addition to CRPC’s numerous technical accomplishments, it convened influential standardization efforts, including those for HPF (see below) and MPI.

High Performance Fortran. The original goal of the CRPC was to make parallel computers usable by ordinary scientists and engineers. A major impediment to reaching that goal was the lack of effective machine-independent parallel programming interfaces. To address that problem, Kennedy's group at Rice and Geoffrey Fox proposed Fortran D , an extended version of Fortran that permits the specification of data distributions for arrays across the processors of a parallel machine. With Seema Hiranandani and Chau-Wen Tseng, Kennedy developed a prototype compiler for Fortran D that established the feasibility of achieving high performance on data parallel programs. This led to the establishment, under his leadership, of the High Performance Fortran Forum, a broad-based consortium to develop extensions to Fortran 90 aimed at high performance on parallel machines. The resulting standard for High Performance Fortran (HPF) was initially well received in the HPCC community: At its peak, HPF was offered as a product by fifteen companies and it was used as the implementation language for more than forty major applications, including some over 100,000 lines. Although its popularity in the United States has diminished, it has been successfully used for several applications on the Japanese Earth Simulator.  One such application achieved a speed of nearly 16 teraflops, roughly 40 percent of the peak speed of the machine. In addition, the HPF standard has been extremely influential in the design of other languages.  The standard document itself is one of the top 25 most cited papers in computer science and all three of the language designs for the High Productivity Computing Systems program include distribution-based features inspired in part by HPF.

The work on HPF led to two other efforts aimed at supporting technologies for scalable parallel computing.

  1. Development of the D System , a suite of tools based on ParaScope for the support of HPF programming (with Vikram Adve, Alan Carle, Chuck Koelbel, John Mellor-Crummey, Dan Reed and Scott Warren). This effort pioneered new ways for the compiler and performance tools to collaborate to display useful information about program performance in terms of the original source code. The principal strategy is based on the notion of the compiler recording detailed mappings from the source to the generated code and passing these to the performance display system.

  1. Compilation technology to support the solution of problems on irregular meshes (with Geoffrey Fox, Chuck Koelbel, John Mellor-Crummey, Joel Saltz, and David Whalley). This work concentrated on high-level strategies to generate run-time analysis of communication patterns and memory hierarchy usage that can dramatically improve performance in the main computation

Grid Application Development Software (GrADS) Project . This multi-institutional research effort (with Fran Berman, Andrew Chien, Keith Cooper, Jack Dongarra, Dennis Gannon, Ian Foster, Lennart Johnsson, Karl Kesselman, Dan Reed, and Linda Torczon and funded by NSF NGS) is exploring compilation and system technologies to support programming for heterogeneous distributed computing platforms. This effort has developed a new vision for program preparation and execution that helps to achieve reliable performance in the presence of dynamic change in both computation and communication resources available to applications.

GrADS produced a coherent vision for an execution environment and application development tools that should make it far easier to produce grid-ready applications in the future. Two key concepts were central to this vision. First, an application is encapsulated as a configurable object program (COP) , which can be optimized rapidly for execution on a specific collection of Grid resources. Second, the system relies upon performance contracts that it uses to validate expected application performance against observed behavior on available resources. Implementations of these ideas were developed and tested using several applications that ran on the GrADS “MacroGrid” testbed, which consisted of geographically Linux clusters with GrADS software preinstalled.

The importance of the GrADS work was recognized by the award for the “Virtual Grid Application Development System Project (VGRADS),” an NSF Information Technology Research (Large) grant to extend these ideas to “virtual grids.” The principle advancement beyond GrADS has been the development of fast strategies to deliver virtual grids to applications; these virtual grids would be culled from the global space of available resources based on abstract specifications. The project has also produced new scheduling strategies based on automatically constructed performance models to speed up the completion time of multi-step applications on the delivered virtual grids. The VGrADS tools have been used to produce a Grid implementation of EMAN, a major image reconstruction algorithm from the National Center for Macromolecular Imaging (NCMI) and is being employed in the NSF-sponsored LEAD project for mesoscale weather modeling.

Los Alamos Computer Science Institute (LACSI). In a major collaboration with the Los Alamos National Laboratory, Kennedy and his collaborators have been developing new strategies for improving memory hierarchy management in scalable parallel computing systems. This work has focused on the use of aggressive loop fusion to reduce main-memory bandwidth requirements in scientific applications.  New results have included a fast heuristic algorithm for global weighted loop fusion that permits dynamic recomputation of weights and takes limited resources into account. In addition, Kennedy and Ding have explored the concept of reuse distance to guide computation reorganization to improve cache utilization.  Finally, they have developed very successful data reorganization transformation to further improve the utilization of limited memory bandwidth.

Another critical strategy has been the generation of advanced blocking strategies for scientific codes. Kennedy and his student Qing Yi developed a recursive cache-oblivious approach (with Vikram Adve) for blocking scientific loops.  In addition, they produced a new transformation, called dependence hoisting , that can be used to block LU decomposition with partial pivoting and QR automatically, achieving performance comparable to the hand blocking used in LAPACK.

The LACSI project is also exploring automatic tuning of applications and component libraries for new architectures through the use of parameterized trial executions and heuristic search. Kennedy and his student Apan Qasem have developed a tuning system for general applications that employs register and cache blocking with loop fusion. This system has been successfully applied to speed up a number of well-known compact benchmarks.

Java Compilation. With his former student Zoran Budimlic, Kennedy has been invstigating new compilation strategies that improve the performance of the Java programming language when used for high performance calculations in science and engineering. A central goal of this work is to make the use of fully polymorphic object-oriented programming strategies practical for scientific programming. To date, the effort has concentrated on static whole-program strategies for optimization that can be applied with graceful degradation even when parts of the program are not known. The project has developed a strategy called “object inlining” that replaces arrays of objects with arrays of their fields, substantially reducing the costs of memory operations. The JaMake transformation system, which performs object inlining, was used to optimize the Los Alamos CartaBlanca application, which won a 2005 R&D 100 Award.

Telescoping Languages. The telescoping languages project is exploring compilation support for construction of high-performance problem-solving environments . This work explores the idea extensive pre-analysis of libraries that are used as the basis for high-level, domain-specific programming systems. The goal of the pre-analysis is to generate compilers for library-based problem-solving systems that achieve excellent performance in the generated code within acceptable compilation times. To accomplish this goal, the strategy would construct translators that can optimize library calls as if they were primitives in the base language. This approach has promise for improving software productivity in scientific computing and in other domains as well. It has been used to develop (with Arun Chauhan and Cheryl McCosh) a system called LibGen that has successfully generated eight Fortran variants of Dan Sorensen’s ARPACK ArnoldiC routine from a single Matlab source version. These variants are competitive in performance with the hand-coded Fortran produced by Sorensen’s programmers. A key technical component of this system is a polynomial-time constraint-based algorithm for inferring types in Matlab functions.

Telescoping languages is also the basis for compilation of other high-level scripting languages, including the statistical language R (a project led by John Mellor-Crummey). In addition, a telescoping languages system can serve as a high-performance component integration system because the pre-analysis and optimization permits substitution of specialized functions or sequences of functions in place of a general component method invocation.

Technology Transfer and Service. In addition to his research contributions, Kennedy has led numerous technology and knowledge transfer efforts in his role as Director of the CRPC. Notable among these is the National HPCC Software Exchange, which he founded with Jack Dongarra, Geoffrey Fox, Jim Pool, and Rick Stevens, and the CRPC Retooling Project, an effort to develop educational materials that can be used by supercomputer center staff trainers to teach new concepts in parallel computation. With Rice colleague Richard Tapia, he has designed nationally recognized programs to increase the participation of minorities and women in science and engineering.

To provide accessible resources for students of advanced compilation for high-performance computers, Kennedy has published, with Randy Allen, the textbook Optimizing Compilers for Modern Architectures (Morgan-Kaufmann, 2002), which covers substance of his research on vectorization, parallelization, and memory-hierarchy management via compiler. He also co-edited the collected volume Sourcebook for Parallel Computing (Morgan-Kaufmann, 2003), which covers research materials on parallel computing software and algorithms developed during the lifetime of the Center for Research on Parallel Computation.

From 1997 to 1999 Kennedy served a two-year term as co-chair of the Presidential Information Technology Advisory Committee (PITAC), which resulted in the influential report, “Information Technology Research: Investing in Our Future.” For his leadership on PITAC, he received the Computing Research Association Distinguished Service Award (1999) and the RCI Seymour Cray HPCC Industry Recognition Award (1999).

Research Awards. In 1990, Kennedy was elected to the National Academy of Engineering. He is a Fellow of the American Association for the Advancement of Science, the Institute of Electrical and Electronics Engineers and the Association for Computing Machinery. In 1995, he received the W. W. McDowell Award, the highest research award of the IEEE Computer Society. In 1999, he was named the third recipient of the ACM SIGPLAN Programming Languages Achievement Award. He has also been honored as a distinguished alumnus of Rice University (2002), as a finalist for the ComputerWorld 21 st Century Achievement Award for HiPerSoft (2004), and by election to the American Academy of Arts and Sciences (2005).

Five of Kennedy’s papers were selected to appear in the recent collection 20 Years of the ACM SIGPLAN Conference on Programming Language Design and Implementation (1979–1999): A Selection (K. McKinley, Editor): “Coloring Heuristics for Register Allocation” (with P. Briggs, K. Cooper and L. Torczon), “Interprocedural Side-Effect Analysis in Linear Time” (with K. Cooper), “Interprocedural Constant Propagation” (with D. Callahan, K. Cooper and L. Torczon), “Automatic Loop Interchange” (with J. Allen), and “Improving Register Allocation for Subscripted Variables” (with D. Callahan and S. Carr).