The performance gap between disks and other components of a computer system is large, and has been steadily widening over the years. Disk speeds are growing by only about 7% a year, and are not keeping pace with the 55% annual increase in CPU speeds. A substantial fraction of modern day systems are disk intensive, such as database systems, fileservers, large working-set web and ftp servers, and various types of scientific computation. New application types like multimedia clients and servers are driving the need for differentiated quality of service in the disk subsystem. It is thus becoming increasingly important for modern operating systems to provide high performance and accurate quality of service to applications.
The disk subsystem in an operating system can be viewed as a channel for disk requests from applications to the disk controller. A disk scheduler is a vital component of the disk subsystem, whose goal is to reorder these requests. This chapter presents an overview of disk schedulers as implemented in most operating systems.
Consider a single physical disk attached to a computer system. Applications invoke system calls like read and write to gain access to disk blocks. These are sometimes serviced from the filesystem cache that the kernel maintains; misses in this cache become disk requests. Applications may similarly map files into memory, and somehow access this memory region. This causes page faults that may sometimes also result in disk I/O. Most operating systems perform background maintenance tasks such as swap-space management and disk defragmentation. These could lead to disk I/O that is not directly initiated by applications.
These requests are serviced by the disk one after another, in some sequence. This sequence is determined by the disk scheduling policy, and thus, by the disk scheduler implementation. The disk scheduler is an artifact of the operating system that receives multiple requests, queues them internally in some fashion, and dispatches them one by one to the disk driver for (usually) immediate service.
Disk scheduling is fundamentally different in many ways from the related problem of CPU scheduling. Disk scheduling is explicitly request-driven. These requests are associated with two metrics: a request size, which is known in advance, and a request service time calculated in finish. These respectively result in performance measures like throughput in MB/s and disk utilization in fraction of disk time spent servicing requests1. In contrast, CPU quanta are only associated with the time spent running on the processor. Disk scheduling is a non-preemptive discipline, i.e. a request, once dispatched to the disk for service, cannot be withdrawn. In CPU scheduling, the processor can be relinquished early or the quantum forcibly preempted. Context switching overheads for disk take the form of seek and rotational latencies, and may range from very small to very large, depending on request placement. Context switching in CPU is moderately expensive, and does not possess as much variation. Such differences lead to problems unique to disk scheduling, such as deceptive idleness.
A disk scheduler is structurally composed of two parts: the scheduling framework, and an implementation of the scheduling policy. The framework provides the required interface between the scheduling policy and the rest of the kernel. The policy forms the actual implementation of some kind of request queueing, and thus provides a set of scheduler-specific methods invoked by the framework. The idea is to keep the framework constant between individual schedulers, while changing just these methods.
A scheduler is said to be work-conserving if it never lets the disk idle whenever there are requests waiting to receive service. A non-work-conserving scheduler may allow the disk to become idle, despite the presence of pending requests. The next section examines the structure and functioning of a traditional work-conserving disk scheduling framework.
The scheduling framework provides the infrastructure that receives requests from upper layers of the kernel, dispatches requests to the disk driver, handles completion of request service, etc. We slightly overstate its importance, since this is the part of the scheduler that we shall later be most concerned about.
The scheduler operates between two states: IDLE and BUSY, thus reflecting the running state of the disk (figure 2.1). The disk is initially idle. Requests arrive at the scheduler at any time; these are enqueued into the request pool using the scheduler-specific sched_enqueue(req) method. If the disk was idle at this moment, the work-conserving scheduling framework is compelled to issue a request to the disk immediately, thus calling schedule. This moment in time is a decision point. The scheduler chooses a request using the sched_choose() method, removes the request from the pool using sched_dequeue(req), switches state to BUSY, and dispatches the request to the disk driver for immediate service2. When a request finishes receiving service, the scheduler state is switched back to IDLE and the sched_finish(req) method is invoked if the scheduling policy wishes to be notified of this event. Then schedule is called again if any more requests are found pending (figure 2.2).
enum STATE { IDLE, BUSY } state = IDLE;
integer pending = 0;
issue(new): // invoked by upper layers of the kernel
sched_enqueue(new);
pending++;
if (state != BUSY) schedule();
schedule(): // local function
assert(pending > 0);
assert(state == IDLE);
next = sched_choose();
sched_dequeue(next);
pending--;
state = BUSY;
// send to disk driver for immediate service
PERFORM_DISK_IO(next);
finish(req): // called from the disk interrupt handler
assert(state == BUSY);
state = IDLE;
sched_finish(req); // if the scheduler wants to know
if (pending) schedule();
1.5
The simplest disk scheduling policy is First Come First Served (FCFS), wherein sched_enqueue and sched_dequeue just maintain a FIFO queue. However, this policy is efficient only for very light workloads, with little concurrency between requests. Magnetic disks with movable heads traditionally incur large overheads for positioning the disk head over the desired sector. In order to amortize this cost over larger data transfers, disk schedulers implement different kinds of seek reduction policies.
To give an example, a 7200 rpm IBM Deskstar 34GXP disk sustains a continuous 21 MB/s of sequential data transfer, but only about 5 MB/s of random-access 64k block transfer. This is because a 64k chunk takes 3ms to read from this disk, as compared to an average seek time of about 9ms.
The underlying idea in all seek-reducing schedulers is thus to capitalize on spatial locality of disk access. If a request issuer generates requests that are targetted at nearby disk blocks, the scheduler should be able to service multiple requests with very little head seek and platter rotation. Schedulers take a variety of approaches to implement this, often without starving any requests.
The Shortest Positioning-Time First (SPTF) policy greedily schedules the available request with the smallest head repositioning time. To eliminate the possibility of starving distant requests, Aged-SPTF (ASPTF) gives increasing priority to requests that have been kept pending for long. A scheduling policy commonly implemented in UNIX-based systems is the Elevator algorithm (C-LOOK), wherein the disk head traverses in only one direction, servicing all available requests on its way. On reaching the last request, it quickly moves to the first request and starts over [WGP94,JW91,SCO90]. Chapter 8 discusses these in greater detail.
With the evolution of the Internet, there grew an increasing demand for differentiated quality of service support in operating systems. This broadly translates to suitable management of all system resources, including disk service. Disk schedulers are thus becoming required to provide more functionality than just raw performance.
The central theme in differentiated QoS is to provide support for multiple service classes, and allow applications associated with each service class to be treated according to some service agreement. Traditional operating systems multiplex system resources among resource principals such as processes and threads. Banga et al. [BDM99] proposed the resource container abstraction that separates the notion of a resource principal from a process or a thread, thus providing the potential for fine-grained resource management. We use resource containers as our service class abstraction.
Among various types of differentiated service, this thesis mainly examines the impact of deceptive idleness on proportional-share scheduling. The service agreement in proportional-share disk schedulers assumes the form of a pre-assigned set of proportions; these are expected to be matched by the amount of disk service delivered to active classes. Several proportional-share scheduling algorithms have been proposed; some of these are described in chapter 8.
This thesis uses the Stride scheduler [Wal95,WW95], trivially adapted for disk scheduling. This algorithm performs deterministic proportional-time scheduling, i.e. it proportionally distributes disk service time (and thus disk utilization) among resource containers on a fine granularity. It maintains a separate virtual clock per resource container, and increments it on completion of a request attached to that container. The increment is proportional to the request service time, and is inversely related to the share assigned to the container. At decision points, a request is selected from the container with the smallest virtual clock. A container remaining idle for an extended period of time explicitly leaves; upon returning, it is assigned a virtual clock equal to a global clock that corresponds to the smallest virtual clock of all active containers.
A direct adaptation of Stride to disk scheduling would generally incur significant performance penalty. Some of our experiments therefore implement a relaxed variant of the above, which simultaneously performs seek reduction. At decision points, the scheduler picks any request within a threshold of the minimum virtual clock of all containers, which has the minimum positioning time from the current head position. This is somewhat similar to the PIso disk scheduler implemented in [VGR98].
The disk scheduler always reorders the available requests according to the scheduling policy. The aggregate disk queue may be long and bursty on loaded systems [SCO90], but there may be no request at the decision point that enables the scheduling policy to make a good decision. Disk schedulers are generally work-conserving, and proceed with whatever requests are available. Naturally, application-level objectives of performance and quality of service are not met.
The rest of this thesis characterizes deceptive idleness and its impact on various disk schedulers, and finally provides a systems solution to address its various manifestations.