This chapter examines three broad categories of related work. We overview various scheduling algorithms, then move on to techniques that have been proposed to improve disk request handling in the I/O subsystem (external to the scheduler). Finally, we look at scheduling concepts relevant to other domains like CPU scheduling and rate-based flow control.
The last thirty years have seen an enormous amount of research in the area of disk scheduling algorithms [SCO90,WGP94,BBG$^+$99,HC00,JW91]. The core objective has been to develop scheduling algorithms suited for certain goals, sometimes with provable properties.
Seek reduction is often crucial for extracting even acceptable throughput for disk intensive applications. The Shortest Positioning-Time First (SPTF) policy greedily aims to improve disk throughput by always scheduling the available request that has the minimum head repositioning overhead time [WGP94]. This policy has two drawbacks: firstly, this policy can incur huge maximum response times due to starvation of old requests. Several variants have been proposed to address this problem. For example, Grouped Shortest Positioning-Time First (GSPTF) divides the disk into cylinder groups, and applies SPTF in each group before moving on to the next [SCO90]. Aged Shortest Positioning-Time First (ASPTF) applies an aging function to the times computed in SPTF. When a request remains pending for too long, it automatically becomes more likely to get serviced. There are many aging strategies, corresponding to characteristics of the response time bound. ASPTF generally demonstrates remarkable performance, both in terms of high throughput and bounded maximum response time [WGP94,JW91,SCO90].
The second problem with the above algorithms is their requirement for intricate knowledge of the disk layout, in order to predict future head positions and access times accurately. An appreciable component of access time is accountable to rotational latencies, and these are tricky and CPU-intensive to estimate [JW91,HC00]. A popular disk-independent simplification is to schedule requests based on logical block number (LBN) differences, used as an approximation to head repositioning time. Though there can be significant performance loss (up to 18% in practice) in this approach, the remarkable ease and generality of implementation often warrants its deployment.
An example of these LBN-based scheduling policies is Shortest Seek First (SSF), which provides high throughput, but with possible starvation. A popular and widely deployed variant in UNIX-based systems is Circular Look (C-LOOK), also known as the Elevator algorithm. The head traverses the disk in only one direction, servicing all the available requests on its way. On reaching the last such request, the head moves quickly back to the first available request, and starts over. This algorithm is equally fair towards requests directed at various locations on the disk. It avoids starvation, but can incur huge response times (even minutes) if a process several issues many requests and forces a large portion of the disk to be read [SCO90]. Yet this policy provides fairly good throughput and usually low response time, and is implemented in FreeBSD and Linux.
Quality of service schedulers are increasingly gaining prominence in modern systems. Accurate proportional-share schedulers are required for various high-level quality of service systems, like using reservation domains to isolate co-hosted websites [BGÖS98], and performing admission control to guarantee predictable performance of webservers [AID01]. We examine a brief taxonomy of proportional-share scheduling algorithms.
A proportional-share scheduler is expected to allocate resources between resource principals, in proportion to some assigned shares. Generalized processor sharing (GPS) [PG93] is an idealized ``fluid'' model that theoretically achieves this goal at all times. However, requests are not infinitesimally divisible, and service can be provided only to one principal at a time. These two intrinsic inexactnesses drive the need for approximations like packet-by-packet GPS (PGPS) [PG93], Weighted fair queueing (WFQ) [DKS89] and VirtualClock [Zha91] in the network packet scheduling domain. The first two are based on a notion of global virtual time, whereas VirtualClock considers per-stream time (and has the problem of an inactive stream later monopolizing link usage [PG93]).
Similar proportional-share CPU scheduling policies have been proposed, with subtle differences. Start-time fair queueing (SFQ) [GGV96] is an application of WFQ to a hierarchical multimedia scheduler. The Stride scheduler [Wal95,WW95] uses methods similar to VirtualClock, and uses the flexible ticket abstraction to manage allocations. Move-to-rear list scheduling (MTR-LS) [BGÖS97] simultaneously guarantees cumulative service, delay bound and fairness. (Cumulative service is shown to be more meaningful on a realistic system with multiple resources to manage. In this context, [BBG$^+$99] mentions how managing each resource in isolation is not realistic due to ``closed-loop'' behaviour of most applications; this slightly relates to deceptive idleness caused by resource contention, as described in section 3.2). Yet-another fair queueing (YFQ) [BBG$^+$99] is a disk scheduling variant to WFQ, and performs novel disk-specific optimizations like batching and overlapping.
Lottery scheduling [WW94] is a randomized algorithm that allocates resources to principals on a coarse timescale. A unit of resource is assigned to a principal with probability proportional to its assigned share.
Though scheduling policies have been analyzed in great detail, relatively less attention has gone into examining the interface between the scheduler and the rest of the operating system. For example, Seltzer et al. [SCO90] perform a detail simulation study of various seek reducing schedulers, under the premise that a busy system generally has long and bursty disk queues. Our work realizes that these requests may not be of the right kind, due to the synchronous manner in which applications generate them. Deceptive idleness thus originates at this interface, and is closely related to other techniques implemented in the I/O subsystem.
For write requests, IRIX implements I/O pacing to prevent programs from saturating the system's I/O facilities. It enforces per-file high and low water marks on the number of queued requests. The low water mark here is to buffer write requests and increase opportunities for seek optimization; this can be viewed as a logical counterpart of NWCS for write requests.
Also in the context of efficiently handling asynchronous requests, freeblock scheduling [LSG$^+$00] has been proposed to increase media bandwidth utilization. It dictates head movement paths so as to service asynchronous requests approximately enroute to the synchronous ones. This system bears a logical similarity to NWCS; both improve the scheduling policy by making use of information about request characteristics, acquired at the scheduler-kernel interface.
Like deceptive idleness, a different kind of scheduler starvation has been noted by [JW91] in the context of the Aged-SPTF scheduler. The condition arises if aging requests are prioritized abruptly at some age threshold, and if the rate of incoming requests exceeds service rate. Every scheduler choice becomes forced due to prioritization rather than SPTF, and the scheduler degenerates to FCFS. This is a more obvious phenomenon than deceptive idleness, and the solution clearly involves smooth aging.
Our NWCS solution adaptively determines application behaviour with respect to synchronous request issue. Understanding application behaviour in various aspects of the disk subsystem can yield significant benefits, and such methods are gaining prominence. For example, adaptive block rearrangement [AS93] co-locates frequently referenced blocks, thus improving seek performance.
Filesystem prefetching [SSS99] is a well-researched area, and for moderately regular workloads, asynchronous prefetch can hope to transparently eliminate deceptive idleness and bring about improved performance. To improve the feasibility of prefetch, techniques such as transparent hints by speculative execution [PGG$^+$95] have been proposed. In comparison, our NWCS approach generally achieves slightly lower performance than prefetched I/O, but is feasible in more situations due to weaker demands on predictive power.
We take a brief look at CPU and network interface scheduling, to identify some interesting phenomena related to deceptive idleness.
CPU scheduling is preemptible, so there is no analog of deceptive idleness in this domain. There is, however, the equivalent of seek optimization by scheduling requests from the same process. Namely, Ousterhout has proposed gang scheduling to reduce context switching overhead, by scheduling groups of threads together [Ous82]. Various studies of cache affinity on processor scheduling have also been conducted [VZ91].
On a different note, non-work-conserving CPU schedulers of various types have been clearly established as a favourable mechanism for handling bursty workloads. Differentiated, prioritized levels of service in a webserver can be provided by scheduling incoming requests from within the kernel. Even when a CPU is idle, not servicing a low-priority request can prevent it from choking higher-priority requests that arrive soon after. We can enforce stricter priorities under bursty load. [ADMC98].
A CPU scheduler can be non-work-conserving in a different way, by keeping a few CPUs idle for handling unexpected and bursty workloads. A detailed simulation-driven analysis [RSSD95] has shown its effectiveness for non-scalable workloads, possibly with high variance in arrival rate and execution time.
Rate-based flow control by network packet scheduling has traditionally paced the way for concepts like class-based queueing. This domain is non-preemptible, but there is no real reason to expect deceptive idleness. The high bandwidth-delay product forces applications to always maintain windows of outstanding packets, due to which the scheduler never starves. There is no equivalent of context-switching overhead, so a scheduler has no reason to batch packets.
Interestingly, there is an opposite effect here. Consider
WFQ [DKS89] scheduling between say 11 active queues, one of
which has a share of 10 and the others have a share of 1 each. Then WFQ will
typically schedule ten requests from the first queue, followed by one
request from each of the other queues. This leads to periodic bursts of
packets, which is undesirable for network congestion control etc. To remedy
this, Worst-case Fair Weighed Fair Queueing (WF
Q) is a work-conserving
scheduler that interleaves packets even more than WFQ does [BZ96].
This can be regarded as an optimization in exactly the opposite direction to
NWCS, which aggregates disk requests for seek optimization.
Non-work-conserving packet schedulers were originally unpopular because of suboptimal average performance. However, integrated services networks require end-to-end delay bounds, which is difficult to provide under bursty workloads. Zhang and Knightly use non-work-conserving schedulers that hold packets in the network, and simulate the original traffic stream. In effect, this approach substantially protects delay bounds from bursty traffic [Zha95,ZK96].