The fourth session consisted of two mini-panels. The first of these, chaired by Mary Baker, focused on the issue of benchmarking. Each of the three panelists, John Regehr, Jeff Mogul, and Margo Seltzer, gave a brief opening statement. After these statements, the floor was opened for questions from the audience as well as panelists.
John Regehr described the perverse effects that benchmarks can have on a system. In a case study involving a soft real-time scheduler for Windows NT, certain performance bottlenecks were discovered in device drivers. These had been introduced by the device manufacturer in an attempt to improve performance on well-publicized benchmarks, but had unfortunate side effects. Once one device manufacturer went down this path, competitors had little choice but to follow suit so that they did not suffer in benchmarks.
Jeff Mogul explored the purpose of benchmarks, and distinguished between reproducibility and predictive power. He deplored the tendency of academic researchers to focus on reproducibility, saying that it gave the illusion of scientific precision, but failed to be useful. He argued for a shift in focus to more realistic benchmarks, whose results could better predict real-world performance.
Margo Seltzer was the perfect follow-on to Mogul, since she described a methodology for benchmarking that addressed exactly the shortcomings exposed by him. In this methodology, overall system performance is obtained as the product of a system vector that captures workload intensity and mix, and an application vector, that characterizes the performance of each application in isolation. The elements of the application vector are obtained through microbenchmarks. The system vector can be set to represent any existing or anticipated workload.
The floor was then opened for questions. Dave Patterson directed the first question to Seltzer, asking whether the micro-benchmarks in her methodology had proved useful in capacity planning. She replied that this was indeed the case: they had obtained stunningly good results for Web servers by recording their behavior for 3 months and then using this data to predict 12 months into the future. She acknowledged, however, that she did not yet have such good results for OS benchmarks. Patterson then made the observation that the OS community should start working on other problems besides performance -- this needs leadership. Mogul concurred, but observed that there was often a price premium for performance, and that vendors often had to demonstrate improved performance to counter the propaganda of competitors. Patterson repeated his plea for the OS community to broaden its vision, and pointed to system reliability as an area crying out for quantitative study and characterization.
A different line of questioning was initiated by Godmar Back, who asked whether Seltzer's methodology applied to rapidly evolving systems. He observed that the academic research cycle is typically 2-3 years long, and that benchmarking done on an early version of the system rarely reflects the performance of a later version. This triggered an extended debate involving Back, Seltzer and Karin Petersen. Seltzer's position was that program committees of conferences should be encouraged to accept papers that revisit performance measurements reported earlier. Petersen observed that this alone would not be adequate: authors should be required to clearly identify the causes for performance changes in a way that would give her confidence that their tweaks would work in her system. Seltzer countered that the use of a system vector in her methodology served just that goal: by comparing system vectors, one can determine whether the workload reported in a publication matches another system's workload closely enough to expect portabilty of results.
Kim Keeton then shifted attention to industry standard benchmarks. She observed that the TPC benchmarks for transaction processing were hard to use. They are complex, require a large amount of hardware, and involve numerous configuration and scaling parameters. In addition their use on specific systems is typically conditional upon limited disclosure of results -- a restriction known as the ``DeWitt clause''. Mogul concurred with Keeton, and suggested that the 189 parameters in the TPC benchmark indicated a strong need for self-tuning benchmarks. In defense of TPC, however, he observed that comparisons based on it were a reasonable reflection of real-world performance.
The discussion then shifted to evolution of benchmarks. Satya observed that once a benchmark becomes widely-used, benchmark-resistant strains of systems evolve. This is very much like the dialectic between viruses and immune systems in biology. Perhaps we should borrow a page from that domain and randomly mutate our benchmarks from time to time. Seltzer responded that her application-specific benchmarking methodology was well-suited to this -- you just had to plug in a different system vector to reflect a changed benchmark. Satya expressed concern that the methodology seemed brittle in a different way: if an application changes, the old application vector may no longer hold. To this, Seltzer replied that you receive an application vector with each new release of that application. Regehr remarked that benchmark mutation requires an ongoing investment of time to stay ahead, to which Satya replied that one should accept this as part of the normal cost of developing a benchmark. This requires us to be proactive, and think like a virus, not the immune system. Mogul expressed skepticism about keeping benchmark mutations secret, observing that security through obscurity has never worked. Satya disagreed, drawing an analogy with examinations where last year's version is available ahead of time but this year's is a secret until it is given. Seltzer closed this line of discussion by observing that mutating the benchmark would not, in any event, have helped in the specific example that Regehr had described.
The panel closed with a question from Dickon Reed about the size of a typical benchmark in Seltzer's methodology. Her reply was that it was more important to focus on identifying a system vector that was representative and finite than to worry about its size.