Rice University
Department of Computer Science

Nathaniel McIntosh

Thesis Defense

Compiler Support for Software Prefetching

Abstract

Due to the increasing disparity between processor speed and main memory speed, techniques that improve cache utilization and hide memory latency are becoming increasingly important. For scientific and numerical applications, which place heavy demands on memory subsystems, even small improvements in cache utilization can significantly improve performance.

Compiler-directed software prefetching is one method for improving the way programs use cache. In this form of prefetching, the compiler is enlisted to insert non-binding cache prefetch instructions into a program as it is being compiled. When the program runs, it issues prefetch operations to fetch data items into the cache before they are actually needed, effectively hiding the latency that would ordinarily be incurred due to cache misses.

This work focuses on the compiler's role in software prefetching. We introduce a set of efficiency metrics intended to characterize the contributions made by the compiler when prefetching is applied to a particular program. In a series of experimental studies, we then use these metrics to evaluate the compiler's contributions, first for sequential benchmark programs running on a simulated uniprocessor machine, and then for a set of parallel benchmarks on a simulated distributed shared memory (DSM) multiprocessor. For uniprocessor architectures, our results show that the causes of poor prefetching performance differ from program to program; there is generally no single compiler deficiency responsible. For programs running on DSM machines, the chief problem that must be overcome is that of scheduling prefetches properly. Not surprisingly, we find that a uniform-distance scheduling policy does not perform well on a non-uniform memory access machine. Based on the results of our experiments, we propose and experimentally evaluate a set of new compiler techniques for improving software prefetching, including methods for improving prefetch scheduling, a new form of reuse analysis to reduce the frequency of useless prefetches, and a data-flow framework for gathering information about coherence activity within parallel programs running on DSM machines.

Tuesday, May 13, 1997 @ 1:30 p.m.
Duncan Hall 3076