End-to-End arguments in System Design ------------------------------------- J H Saltzer, D P Reed, D D Clark This paper argues against putting excess functionality in the lower layers of a layered system. The example taken is that of a communication system and it is shown through many examples that it almost never pays to just implement a particular function entirely inside the network because the end-points need to implement it anyway for correctness. Implementing a certain function at multiple levels usually leads to lower efficiency particularly for applications that don't need this extra functionality. Of course sometimes a incomplete version of this functionality is required as an optimization at the lower level. Note that these optimizations may also be put into the application, but this approach may be more expensive because each application will have to implement this functionality. An Analysis of TCP Processing Overhead -------------------------------------- David Clark, Van Jacobson, John Romkey, Howard Salwen This paper tries to get an handle on what really is expensive about protocol processing. It attempts to "evaluate" the theory that protocol processing especially TCP processing is expensive. The authors first analyse the costs of TCP processing. To do this with some fairness to TCP they first reimplement TCP in userspace without mbufs-- (this takes care of factoring out some of the overhead due to OS specifics i.e. buffering and interrupt handling). Next they program a "fast-path" TCP. They count instructions for input processing on both the sender and receiver sides, and also for the output processing (for which they do not have a "fast-path"). Costs of checksumming and PCB lookup are ignored for now. Next they discuss the costs of buffering, timers, checksumming and PCB lookups. With this, they come up with a result that TCP can handle fairly high data rates... this is not "really" true and the reason is the OS overhead etc. Results 1: TCP puts most of its complexity in the sending side, not in the output processing for the data but in the input processing for the control packets. 2: IP costs very little 3: Timers and schedulers should be designed and implemented properly (these guys used Varghese's timers) 4: PCB lookup with the cache can be pretty good 5: Protocols are slow because of OS overhead, I/O device overhead, timers and synchronization, and buffering. Also, operations that touch the bytes are expensive and will become more expensive. Use ILP...? 6: Moving protocol processing to outboard processors or into silicon should not be taken so lightly. Moving to outboard processors would mean that the buffer layer will become more complicated and might actually become very expensive. A silicon protocol designer should keep in mind that they need to interface with the OS too and that is really the expensive part of network processing. A big disadvantage of moving to silicon is inflexibility-- we are still playing with TCP's congestion control algorithms. The Emerging Gigabit Environment and the Role of Local ATM ---------------------------------------------------------- J. Bryan Lyles and Daniel C. Swinehart Network speeds and capacities increase incrementally in response to incremental increases in network traffic. Gigabit n/w is a revolutionary development which will enable applications that could not be supported by incremental improvements to today's n/ws. The original focus was on applications that require Gb/s capacities individually. In contrast, the authors propose that moderate applications an order of magnitude slower individually require aggregate bandwidth in the order of gigabits. In this paper the authors consider the b/w requirements of office applications like electronic copying, live and taped meetings and seminars. They examine trends in lossy compression technology that have decreased the b/w requirements of various "office applications" to the point where the nos. are tractable. Next, the authors examine the role of ATM in making Gigabit n/w'ing a reality. They discuss the "top-down" approach of the telecom carriers - i.e. to introduce ATM in the wide area in support of existing services and gradually indroduce end-to-end enhanced services. The authors propose a complimentary bottom-up approach to set up local ATM network-islands and use these as testbeds for the installation, use and perfection of these new applications. They believe that this will spur the ATMizing of the n/w. The authors next discuss the various challenges in making LATM successful in terms of the stringent cost, size, power and maintainability requirements of LATM. Complexity of terminal equipment (e.g. monitors with ATM plug-ins) is an issue. Multicast in ATM switches, the idea of a DAN, interfacing with other n/ws, addressing and security are other issues. The hypothesis is that LATM switches operate in a different environment and with different requirements than telephone company switches and therefore should be cheaper. The Design and Implementation of a Log-Structured File System ------------------------------------------------------------- Mendel Rosenblum and John Ousterhout This paper says that the FFS is slow and only able to use 10% of the disk's raw bandwidth. Also, with memories getting larger and processors getting faster this is going to hit disk bound jobs more and more. These guys propose a log-based filesystem where everything is written on the log. Reads are not slow because they use index structures similar to those of the FFS. Using a log also simplifies recovery. The major design issue is how to handle fragmentation and how to do cleaning. The two issues they look at are what segments to clean and how to lay out the data got out from cleaning. These guys look at a simple cleaning strategy (using simulation) which simulates a simple form of locality (hot-and-cold) and a single cleaning threshold based on segment utilization and they show that things get worse with increasing locality. They examine the reasons for this and conclude that this is because using a single threshold based on utilization is bad because clean blocks from cold segments are more "valuable" than those from hot segments. They then use a cleaning strategy based on cleaning segments based on benefit/cost where benefit is the product of age of blocks freed times no. of blocks freed. This strategy performs well and leads to a bimodal distribution of data into hot and cold segments. The authors also present some nos. based on a comparision of Sprite LFS vs SunOS 4.0 (i.e. the FFS). Sprite performs much better than SunOS except for one case - i.e. sequential read after random writes. LFS brings about a different form of locality where data accessed around the same time is kept "close" on disk. FFS on the other hand always optimizes for sequential reads and therefore incurs a higher write cost by trying to lay things out on disk well. FFS stuff --------- from the FFS paper, the 4.4BSD book and the Ganger/Kaashoek Usenix '97 paper The FFS's basic aim was to improve over the original UFS. The techniques employed were to use larger blocks (with allocation at the fragment level), and better layout strategies to prevent seeks and rotational latency wherever possible. Later, clustering was introduced in SCO Unix, SunOS and 4.4BSD to get better than half the raw b/w of the disk by doing large writes and read aheads even with controllers that did no track-caching. A few changes were also made to the interface to allow support for locking and for atomic renames. Ganger and Kaashoek introduced some changes for supporting small files even better. The FFS allocates at the fragment level and gets around the grow-one-fragment- at-a-time problem by letting the applications know that the best write size is a block. Layout in FFS is divided into higher level policies which use partial information and heuristics to allocate directories and file blocks to cylinder groups. Later, lower level policies with exact information are used to allocate "rotationally-close" blocks. Clustering is used to break the half raw b/w barrier by writing and reading larger chunks of data. FFS supports advisory locking and not hard locking to avoid interefering with the root- override-all-protection philosophy of Unix. The G/K paper contends that modern filesystems have bad performance for small files. LFS writes well but does not read very well because somehow reads are not always serviced from the cache (as was expected in the LFS paper)- Mary Baker's paper says this is because with larger memories the file sizes being accessed by programs are also growing. FFS lays out files well but things are usually localized to the same cylinder group rather than the same cylinder (and seeks may happen because of requests between consecutive accesses to the same file anyway). Also, seek costs are only half the fixed costs with things like rotational latency, command processing and data movements comprising the other half. These guys propose C-FFS which embeds inodes into the directory entries for small files and explicitly co-locates blocks allocated to files in the same directory. Architectural Considerations for a New Generation of Protocols -------------------------------------------------------------- David D Clark and David L Tennenhouse This paper starts with the concern that existing protocols will not be able to handle the requirements of tomorrows networks. There is a concern that today's protocols will not be able to handle gigabit operation, may not work well over the emerging network technologies and do not support the different service requirements of multimedia and real-time applications. The rest of the paper examines the key principles for the elaboration of a protocol architecture for tomorrow. The key things are: 1) It is important to distinguish between the architecture of a protocol suite and the engineering of a specific system. In particular, layering may not be the most effective modularity for implementation. Therefore it is important for the structure of an architecture to not impose unnecessary constraints that affect the engineering of the systems [early demux paper]. 2) Control manipulations in current protocols cost far less than data manipulations. Operations that touch all the data such as copying, checksumming and presentation conversion limit the throughput of today's protocols. A key aspect of presentation conversion is that it needs to be done in the context of the application. So, the application and not the kernel/outboard processor is the usual bottleneck in network throughput. So, if an application cannot run whenever data arrives from the network, it will fall behind and since it is the bottleneck it will never catch up. A design goal should therefore be to allow the application to run whenver data comes in. 3) In current protocol designs, data loss and reordering prevent real-time presentation conversion. A general purpose protocol ought to permit any of following to handle lost data: buffering by the sender transport, recomputation by the sending application, or proceeding without retransmission. The authors next introduce Application Level Framing. An application breaks up data to be communicated into ADUs- too big means potential waste of bw, too small means too much overhead and may be fundamentally limited by the context. The general principle behind presentation conversion is that the sender must perform enough of it for the receiver to be able to compute for each ADU where it is to go. The authors also introduce ILP. Traditional protocols often do not permit this, but it is important to get good performance on modern RISC architectures where memory is very far from the processor. The key argument in favor of ILP is that an integrated processing loop is more efficient than several separate steps which read data from memory, possibly convert it and then write it back again. Finally, the authors say that although the layered approach has some merit at the network layer and below, this is not true at the transport and higher layers where end-to-end symbols are exchanged. They propose that thes layers can be structured in a less constrained manner. A Case for Redundant Arrays of Inexpensive Disks (RAID) ------------------------------------------------------- David A Patterson, Garth A Gibson, and Randy H Katz This paper says that CPU and memory performance are increasing. I/O is not speeding up nearly as fast and so we have a pending crisis. A solution that may work is to use arrays of disks. Using arrays is slightly problematic because the mean time to failure of an array is 1/nth that of a single disk. Using RAIDs is maybe a possible solution. Several RAID "levels" are proposed and evaluated. RAID level 1 is just mirrored disks. Average seek is lower. Cost is double. RAID level 2 using bit-interleaving with hamming codes for ECC. Some disks are dedicated for as check disks. Large I/Os get full bandwidth of all disks. Small I/Os still require accessing all the disks and so perform dismally. Cost is not as high as for level 1 but may still be substantial. Appropriate for supercomputers but bad for TP systems. Level 3 RAID uses the observation that the disk controller can typically figure out which disk has failed so we do not need as many check disks. This reduces the overhead. RAID level 4 uses block-interleaved parity. Small write requests require 4 I/Os. Since there is only one parity disk, it can become a bottleneck. Level 5 RAID distributes this parity information onto multiple disks and thus allow multiple concurrent writes per group. This paper also mentions a lot of issues about RAID that need to be research on. RAID: High-Performance, Reliable Secondary Storage -------------------------------------------------- Peter M Chen, Edward K Lee, Garth A Gibson, Randy H Katz, David A Patterson The Design Philosophy of the DARPA Internet Protocols ----------------------------------------------------- David D. Clark This paper tries to capture the original objectives of the Internet protocols and how the affected its form. The objectives were (in order of importance): 1) To connect existing nets 2) Survivability 3) Multiple types of service 4) Support for varieties of nets 5) Distributed management 6) Cost effectiveness 7) Low attachment cost 8) Resource accountability The rest of the paper describes the influence of each of these in the design of the Internet Protocols. 1) => packet switching since nets to be connected were p s nets. Again store & forward was used because it was well understood in the context of the ARPANET 2) => datagram and no state in net. all state at end-host. 3) => IP and TCP were split. IP provides basic datagram service over which protcols with various types of service could be built. Sometimes this is hard to do because some nets (like X.25) give reliable delivery (which means uncontrollable delay) and this cannot be turned off. 4) => Minimum set of assumptions about what net should support. Datagrams/ addressing/reasonable reliability. Does not expect reliability/broadcast/ multicast/sequenced delivery etc. If these had been required, all nets/network drivers would have to implement them => a lot more effort than now. 5) Other goals not met so well. Distributed control- yes and no- no because of the restricted way in which routing tables have to be managed. End-to-end reliability means in the face of high errors=> more expensive. High cost of attachment because end-host has to have TCP/IP software. Poor host implementations can hurt the net. Accountability zero. Datagram- not so good for accountability. Need "flows"? The UNIX Time-Sharing System ---------------------------- Dennis Ritchie and Ken Thompson Pilot: An Operating System for a Personal Computer -------------------------------------------------- Redell et. al. Pilot is a system with a different initial set of assumptions and goals from those of most time-sharing systems. It is a single-user, single-language computer with only limited features for protection and resource allocation. Assumes errors are a more serious problem than maliciousness and resource allocation is not oriented towards fair distribution of scarce resources. An extended run-time system for Mesa. Protection based on Mesa's type-checking. 1) 64-bit file uids are both defensive capabilities for files and also each file has a unique time and space uid. This is because comm. between systems is expected to take place through removable volumes and because the system does not implement character string naming. Direct support for scavenging operation. Immutable files for supporting linking of files using same uid on multiple machines (copying normally mutates uids). Possible to have large volume that spans multiple physical volumes and visa-versa. 2) Single address space consists of runs of pages called spaces. Spaces mapped directly to files. Spaces unit of replacement and not pages. Also have Space.activate/deactivate functions to help in providing specialized replacement strategies. Hints, improve performance and are viable as this is a single user system. No separate read/write interface supported for 4 reasons- decoupling buffer allocation with disk scheduling, automatic protection, advice taking (space.kill) helps to reduce unecessary I/Os in this scheme, r/w can be implemented in terms of mmap, not otherwise efficiently enough. Synchronous write also provided. 3) Streams and I/O devices 4) Loopback optimization, sockets (UDP like service), NetworkStream on top of sockets - like TCP, listen, bind, accept etc. One difference- close assumes that no further traffic will happen. Clearinghouse like the YP map. 5) Manager/kernel approach in implementation. 6) Explicit support for scavenger- self-identifying pages for robustness and auxillary maps for efficiency. Support for higher-level integrity checking during scavenging. Distributed File Systems: Concepts and Examples ----------------------------------------------- Eliezer Levy and Abraham Silberschatz Naming and transparency- location transparency and independence, NFS is transparent but not independent because when you move a file, its pathname changes. AFS is independent. Extra level of indirection. Naming scheme tradeoff- does server do lookup search or does client? Semantics of sharing- Unix semantics, strong consistency+sharing file pointer. Session semantics- consistency enforced across machine boundaries only on close. Immutable file semantics. Transaction semantics. Remote access methods- remote service or caching. inefficiency vs. cache consistency problem. cache consistency and statefulness of servers vs fault tolerance. Fault tolerance- statefulness of servers. availability and hints. replication and consistency. Scalability- symmetric functions, avoid bottlenecks, lwps Unix United: network transparency with out naming transparency, unmodified kernel, distributed calls intercepted above the top half, .. trap, forwarded lookups Locus: block caching similar to Unix with tokens for consistency, complete unix semantics including pointer sharing, primary copy must be available for writing, other copies may be older but not partially modified, replicated mount table and CSS are bad for scalability, hard recovery because of state, fast because of kernel implementation. NFS: Remote service mixed with caching, timing dependent sharing semantics, remote lookups with lookup cache, complete stateless service, NQNFS- some soft state with leases. Sprite: Based on aggressive caching, remote service for consistency when writing, prefix tables for lookup, state at server for consistency, Unix semantics, poor fault tolerance- questionable scalability because of broadcast. Andrew: Whole file caching, session semantics, Venus, Vice, clustering, client lookup, replicated location servers, authentication and encryption, access-list for protection, fault-tolerance issues... Working Sets, Cache Sizes, and Node Granularity Issues for Large-Scale ---------------------------------------------------------------------- Multiprocessors --------------- Edward Rothberg, Jaswinder Pal Singh, Anoop Gupta This paper presents an application-driven study of the issues relevant in determining the distribution ofresources among the processor, cache and memories of large scale multiprocessors. 5 application classes were looked at. The conclusion is that a moderate-sized cache is good enough for and either does not scale with problem/machine size or scales very slowly. Also, all but one problems scale well with decreasing grain-size. For the FFT things are bad as far as grain-size go. There are some compelling engineering reasons against building very fine-grained machines namely multi-chip memories for high memory b/w, programming model that makes remote accesses expensive, large fixed costs of communicating and relative costs of processors and memory. Complexity/Performance Tradeoffs with Non-Blocking Loads -------------------------------------------------------- Keith I Farkas and Norman P Jouppi In this paper these guys study the effect on performance of the SPEC benchmarks by implementing non-blocking loads. They show that non-blocking loads can decrease memory CPI by factors of 2 for CPU integer benchmarks and a factor of 4-10 for numeric benchmarks. It is worthwhile supporting multiple in-flight misses for numeric benchmarks. It is also worthwhile supporting multiple misses to the same set with different tags for direct-mapped caches. Similar behaviour is observed for larger caches. Cache line size decides whether it pays more to support multiple primary misses or multiple secondary misses. It is also important to schedule for misses than for hits. Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors ------------------------------------------------------------------------ John Mellor-Crummey and Michael Scott The key thing that this paper says is that efficient locks and barriers can be implemented in s/w. The key thing for locks to avoid network traffic while spinning is to spin on a private copy of a certain variable which is signalled on a wakeup. This private copy can either come about by cache-coherence or "replicated" in s/w. In the latter case a wakeup wakes up only one processor. The remaining problems are related to avoiding false sharing, appropriate statically determined layouts for the replicated spin variables to avoid having to access remote locations, and space requirements per lock. For barriers, the key thing again is to spin on local variables while waiting and to most efficiently "collect" and "dessiminate" arrivals and the wakeup. Sense-reversal is also an issue for correctness. Again we want the location of the shared-variables in collection and wakeup trees to be determined at run-time. The main argument then is against "dance-hall" architectures. A Cache Coherence Scheme with Fast Selective Invalidation --------------------------------------------------------- Hoichi Cheong and Alexander Veidenbaum In this paper these folks propose a new software-based cache coherence mechanism. It depends on 1) compile-time insertion of invalidate-cache instructions, 2) compile-time marking of read-references, 3) hardware support for "cache-read" and "memory-read" references and 4) reference-time checking and invalidation of individual cache entries. This approach invalidates only that data that potentially stale and leaves data "known" to be current alone. It is better than indiscriminate invalidation since it lowers miss rates. It is faster than selective-invalidation. It is better than other reference marking schemes since marking something does not in itself imply invalidation- some run-time information is used to avoid missing on data that is really upto-date. In particular, when a stale location is updated, it can then be cache-read till the next invalidate. On the Duality of Operating System Structures --------------------------------------------- Hugh C Lauer and Roger M Needham In this paper the authors present that process-based OSes are duals of procedure-based OSes. In particular a performance conserving mapping exists to take a program from one kind of system to the other. Scheduler Activations: Effective Kernel Support for the User-Level Management ----------------------------------------------------------------------------- of Parallelism -------------- Tom Anderson, Brian Bershad, Ed Lazowska, Henry Levy This paper says that kernel level threads are inefficient and user-level threads suffer from poor system-integration. The propose scheduler-activations which are based on the following ideas: 1) Processor allocation is done by the kernel 2) Thread scheduling is done by the user-level 3) The kernel notifies an address-space scheduler about every event that affects it. 4) The user-level informs the kernel about the subset of events that can effect processor-allocation decisions. This is implemented using kernel-thread like things that go up and down which are called scheduler activations.