Current microprocessors improve performance by exploiting
instruction-level parallelism (ILP). ILP hardware techniques such as
multiple instruction issue, out-of-order (dynamic) issue, and
non-blocking reads can accelerate both computation and data memory
references. Since computation speeds have been improving faster than
data memory access times, memory system performance is quickly
becoming the primary obstacle to achieving high performance. (This
trend is sometimes called the "memory wall".) Pai's work focuses on
exploiting ILP techniques to improve memory system performance. In
this talk, he will present both an analysis of ILP memory system performance
and optimizations developed using the insights of this analysis.
Pai will also show that ILP hardware techniques, used in isolation, are often
unsuccessful at improving memory system performance, because they fail
to extract parallelism among data reads that miss in the processor's
caches. Software prefetching provides some improvement by initiating
data read misses earlier, but also suffers from limitations caused by
exposed startup latencies, excessive fetch-ahead distances, and
references that are hard to prefetch.
Pai uses the above insights to develop compile-time software
transformations that improve memory system parallelism and
performance. These transformations improve the effectiveness of ILP
hardware, reducing exposed latency by over 80% for a latency-detection
microbenchmark and reducing execution time an average of 25% for 7
applications studied. These transformations also address key
limitations in current software prefetching algorithms, reducing
execution time an average of 24% relative to prefetching alone.