Many sleepy eyes and some yawns greeted Doug Terry, as he chaired the first session on Tuesday morning. However, these soon gave way to interest and attention, as three exciting papers on unusual topics were presented.
The first paper, entitled ``The Case for Higher-Level Power Management'' by Carla Ellis, described why operating systems should treat energy as a first-class resource. An appealing aspect of this presentation was that its motivation was based on first-hand experience. In trying to build an application called a Hiker's Buddy using a PalmPilot, the author had come to grips for the first time with trying to manage energy in a real-life situation. One of the slides in the talk, showing a stunning photograph of the author on a hike in the Olympic Peninsula with her Hiker's Buddy, won sustained applause from the audience.
Mitchell Tsai began the question session by first observing that Alexey Rudenko in his group was working on power management, but faced multiple problems. First, he needed a method for devices to wake up the computer under certain conditions but this wasn't possible on today's hardware. Second, battery life depends on how you drain it, and this is not a simple linear function. Ellis agreed that new architectural features were needed to support higher-level energy management. Tsai followed up with the observation that if total energy consumption could be kept below that obtainable from solar energy, the whole problem would go away. Ellis responded, to much laughter from the audience, that she was in the Pacific Northwest where solar energy was not a viable option.
Margo Seltzer complimented the speaker on the work, and asked why the GPS needed to talk to the Pilot for the entire three minutes, while it was trying to stabilize. Ellis replied that the duration was not always three minutes, and that the requirement was driven by the dumb design of the GPS unit. Milan Milenkovic also commended the speaker for the fine work, and observed that we need better APIs for controlling OS power consumption. He cited the Intel/Microsoft ACPI specs, and wondered how much of this information was available to applications. Ellis replied that she would have to check on this.
Peter Chen presented the second paper, entitled ``Reliability Hierarchies''. This paper argues that system designers should treat reliability as a property that induces a hierarchy of levels in a system, much as access time induces a memory hierarchy. This viewpoint gives the designer a conceptual tool for balancing performance and reliability. Chen elaborated on these ideas using the Rio system as a case study.
The first question, by Dave Patterson, asked whether the server at Michigan from which statistics were reported in the paper used RAID. Chen replied that it did not, and the disks were just a bunch of ordinary disks. In response to Patterson's followup question on how RAID would have affected the reliability numbers presented, Chen replied that it have improved the media failure figures but these only account for a small part of all failures. RAID would have done nothing for software crashes.
Mary Baker then triggered an extended dialog with the speaker by observing that the use of a special sync call in Rio, to get data out to disk just before a crash, is a weakness in the system. In her experience, such special features are difficult to test and debug, and hence hard to trust. Chen countered that the mechanism had been tested in thousands of crashes. Unconvinced, Baker pressed the point by observing that these crashes were deliberately induced and that real-life crashes may be messier. Chen held his ground, only conceding that more extensive testing is always useful.
Satya asked whether the server data presented in the paper was from live use of the system, or from stress tests. In particular, he wanted to know if the speaker's email was stored on the server. Chen replied that the data was not from a system in live use.
Referring to the methodology for inducing crashes, Karin Petersen asked how one could reliably capture control at the start of a crash, and whether the artificial crashes used in Rio accurately reflect how real crashes happen. For example, her NT box freezes once a day, requiring a hard reset. She couldn't see how a system like Rio could capture control in such cases. Chen replied that the current version of Rio does not exhibit freezes, though earlier versions did. He agreed, however, that nothing is perfect and that there were undoubtedly some kinds of crashes which the system would not be able to handle. In response to a followup question, Chen said that 3% of the crashes lost data stored on disk, but only 2% of them lost data stored in Rio. A related question from Jeff Chase asked how often the special crash-inducing key sequence had to be used; Chen replied that he did not have this information since he had not personally conducted the crash testing.
Persisting in Petersen's line of questioning, Ian Pratt observed that many failures on their PCs caused the hardware to lock so solidly that even the power switch was ineffective since it was only an interrupt to the BIOS. Chen replied that the Rio approach assumed that interrupt masking was not happening at the hardware level, but at the software level. If this assumption isn't true, Rio can't handle the resulting crashes. Jeff Chase added that such hard failures are typically due to buggy hardware.
Peter Chubb observed that a common failure mode of NFS was to fill a buffer with nulls rather than data; flushing that buffer to disk is the worst Rio could do. Chen observed that this was a file system failure, and that Rio could hardly be held responsible. Chubb followed up by observing that Chen was effectively describing a checkpointing approach. Chen agreed with this characterization, and added that deciding on the commit point is the hard part. Rio treats the start of a crash as the commit point; particular applications may have other notions of where the commit point should be.
A shift in perspective was offered by Bill Tetzlaff, who noted that 50 years of tape data were unreadable and hence lost, because they were recorded using obsolete hardware or software. Chen responded that in his methodology, loss rate metrics treated all data as equally important. In practice, of course, that is not right -- very old data might be unimportant. One could come up with a whole family of metrics to address this limitation.
The final presentation in the session was by Mitchell Tsai, whose paper ``Command Management System for Next-Generation User Input'' described his experiences in using speech recognition for application control. The central lesson reported by him was the importance of striking the right balance between centralizing speech functionality as an OS component and exploiting application-specific knowledge in handling ambiguity and other language-related problems. A successful design does not have a clean layering, but is rather messy. Tsai also identified a global undo capability as an important OS feature.
Tom Kroeger began the question session by asking how one could avoid recursion in undo. Tsai replied that this was indeed a problem, but that it could be alleviated if the system recognized that it was already in an undo context. To Kroeger's follow up question on whether he had used speech recognition systems other than Microsoft's, Tsai replied that he had Dragon Dictate and many others, but depended on Microsoft for the PowerPoint application and the speech system development kit.
Mike Jones suggested that the reason for needing pervasive undo was to avoid requiring total accuracy in recognition. Tsai responded that an alternative approach would be to structure applications so that all major actions were structured as a two-phase commit, where the user is given an indication of how the application has interpreted his spoken commands before execution became irreversible. Jones followed up by asking if there was such a thing as an undo barrier, and if so, when one would use it. Tsai's reply was that this would depend on the application, and that it wasn't particularly useful with PowerPoint.
Milan Milenkovic asked for clarification on a point made during the talk, about accuracy improvement through dynamic restriction of command choices. Tsai replied that one could do that in a command and control system through a restricted grammar.
At the close of the session, Satya asked whether applications like Web browsing had been considered, and whether such read-only applications were more tolerant of speech recognition errors. Tsai replied that he had not tried that, though he was aware of other researchers who had. He observed that Microsoft's Internet Explorer already offers much better hooks for speech command and control than other browsers and Microsoft Office.