We have examined a variety of situations where a naive implementation of some disk scheduling policies can fail to meet application expectations. We have seen how deceptive idleness affects real applications in practice, and causes significant losses of performance and QoS accuracy. We therefore need a simple, application-transparent, low-overhead solution to deceptive idleness that enables any scheduler to function as intended. This solution should be suitable for implementation in a general-purpose operating system, and should smoothly and efficiently complement workarounds like prefetching, initiated either from applications or from the filesystem. Lastly, in the more common case of not encountering deceptive idleness, the solution should have little or no impact on system performance. This section describes such a solution.
Deceptive idleness manifests due to a combination of three factors: (a) disk-intensive applications concurrently issuing synchronous streams of disk requests, (b) the intrinsic non-preemptible nature of disk requests, and (c) a work-conserving scheduler that doesn't wait for the subsequent request to arrive. Our solution takes the intuitive approach of changing (c) to wrap the scheduling policy in a non-work-conserving framework. At decision points, the scheduler now potentially waits for a short period of time for additional requests. Applications that quickly generate their subsequent requests can do so before the scheduler takes its decision; deceptive idleness is thus avoided. The disk is kept idle during this waiting period, as is the characteristic of all non-work-conserving schedulers. However, this wastage is not necessarily detrimental. The remainder of this section explains how a careful development of this method actually improves throughput consistently, and adheres to quality of service goals better.
The question remains as to whether and how long to wait at a given decision point. A naive approach may make the simplification of waiting for a fixed duration at every decision point, irrespective of the scheduling policy. This obviously gives us poor performance, so we wait selectively. In particular, the scheduler waits for the duration over which it expects the benefits of waiting to maximally outweigh the waiting cost.
However, precise interpretation of costs and benefits depends on the scheduling policy. A seek reducing scheduler may wish to wait for contiguous or proximal requests, whereas a proportional-share scheduler may prefer weighted fairness as its primary criterion. To allow for such flexibility, we modularize our solution into a scheduler-independent core NWCS waiting mechanism, and separate, scheduler-specific NWCS decision heuristics. The waiting mechanism forms a framework around both the scheduling policy and the heuristic. It implements the generic logic involved in waiting (with timeouts etc.), and invokes the two methods exported by the heuristics: evaluate, which makes informed decisions about whether and how long to wait at a decision point, and collect_stats, which accumulates statistics required for making those decisions. These heuristics are implemented separately for each scheduling policy, and have access to the internal data structures of the policy implementation (figure 6.1).
The remainder of this section details the core NWCS mechanism, then spells out two assumptions, and then describes separate NWCS heuristics for seek reducing and proportional-share schedulers. Finally, it covers some implementation issues.
A traditional work-conserving scheduler operates between two states, IDLE and BUSY, with transitions on scheduling and completion of a request. Applications can issue requests at any time; these are enqueued into the scheduler's pool of requests. If the disk is idle at this moment, or whenever another request completes, a request is chosen from this pool and scheduled. This calls the scheduling policy function select, dequeues the chosen request from the scheduler pool, and dispatches it to the disk driver.
Our proposed NWCS mechanism forms a wrapper around a traditional scheduler. It invokes the scheduling policy to select a candidate request. However, instead of dequeueing it immediately, it first evaluates this request using the heuristic, which uses some nontrivial decision process to return either zero or a positive integer. Zero indicates that the heuristic has deemed it pointless to wait; we therefore proceed with the candidate request. However, a positive integer represents the waiting duration in microseconds that the heuristic has judged suitable. The scheduler initiates an event-notification based timeout for that period, and switches to a new WAIT state. Though the disk is inactive while in this state, it differs from IDLE by having pending requests and a ticking timeout.
Applications may issue new requests in this waiting period. In fact, a good heuristic would imply high likelihood that a new request will indeed arrive, and be preferable. This incoming request is not immediately serviced, nor is it even directly considered for evaluation. It is simply enqueued into the scheduler pool, and the policy selection routine is invoked to choose the best candidate request at this point. This approach retains the intrinsic properties of the policy like whether or not it has potential starvation, prevents duplication of scheduling policy code, and achieves an elegant separation of concerns between the policy and the heuristic. This candidate request is evaluated as before, and dispatched immediately (and the timeout is cancelled) if the heuristic decides thus. However, the waiting case is different: if the scheduler is in WAIT state and the heuristic decides to wait for an additional specified duration, then we disregard this duration and refuse to retrigger the timeout. This prevents unbounded waiting by repeatedly retriggering the timeout, since the suggested duration by the heuristic never exceeds a certain limit (15ms). If the timeout expires before any suitable incoming requests arrive, then we dequeue and dispatch the current candidate request. This algorithm is depicted in the state diagram in figure 6.2.
We cause up to milliseconds of delay between consecutive requests. The disk platter continuously spins under the head, so the target sector on the next request may be missed, and may require a complete 8ms rotation to return to. Thankfully almost all modern disks have a track buffer that prefetches and caches sectors as the disk spins by. This does not solve deceptive idleness, but it allows for an efficient NWCS solution.
We tentatively make two simplifying assumptions about the granularity at which applications issue requests. We may observe a slight drop in performance if either assumption is violated.
On completion of a request, we try to wait for the causally dependent request. This request could potentially originate from any process, leaving us very little information about how long to wait. However, for most applications, dependence between requests gets reflected in code structure. The subsequent request is typically generated by the same process, either in a loop or based on the outcome of the last request. It is rare that a group of processes explicitly coordinate to issue a synchronous stream of requests (an example of this is the FreeBSD implementation of aio_read using multiple kernel processes). This common-case application structure makes it sufficient to only wait for the process that issued the previous request. Therefore, assumption #1: Synchronous streams of disk requests are issued at the abstraction level of individual processes. These processes can be running in either user or kernel mode. If the system supports kernel threads, and if applications use them, then we should replace processes by kernel threads.
This assumption lets us add a clause to the NWCS mechanism that diminishes the impact of sudden variations in application behaviour, and corresponding heuristic errors. Consider a process that generates a stream of rapid, good requests, and suddenly issues one bad request (due to filesystem layout, for example). The heuristic expects a good followup request to arrive soon, and decides that the policy-chosen candidate is bad enough to warrant further waiting. But the process has blocked on I/O, and will not issue any further requests. Assumption #1 allows us to disregard the heuristic decision, cancel the timeout and force an immediate dispatch of the chosen candidate.
The previous assumption seems sufficient to accumulate sensible per-process statistics to assist in request evaluation. However, a process could be issuing two types of synchronous read requests, like accessing one file sequentially and another randomly. This would require statistics to be collected at the level of granularity of file descriptors, for example. We tentatively assume that this doesn't happen, thereby stating assumption #2: Barring occasional deviations, there is reasonable homogeneity in the properties of synchronous requests issued by a process. Suggestions to relax both these assumptions are provided in section 6.8.
This section describes the scheduler-specific NWCS heuristics for seek reducing schedulers such as SPTF, ASPTF and C-LOOK. The Shortest Positioning Time First [SCO90,JW91,WGP94] policy calculates the positioning time for each available request from the current head position, and chooses the one with the minimum. The goal is to design a heuristic that makes its waiting decisions to maximize expected throughput.
This heuristic evaluates a candidate request as chosen by the policy. The intuition is as follows: if this candidate request is close to the current head position, then there is little point in waiting for additional requests. Otherwise, using assumption #1, if the process that issued the last request is likely to issue the next request soon (i.e. its expected median thinktime is small), and if that request is expected to be close to the current head position, then we decide to wait for it. This waiting duration is the expected 95-percentile thinktime, within which there is a 95% probability that the request will arrive.
This idea can be generalized into a succinct cost-benefit equation. The key
point is to profitably balance the benefit of waiting, i.e. expected gains in
positioning time, against the cost of waiting, which is the additional time
likely to be wasted. If LP is the last request issuing process, and
is the time since completion of the previous request, then
Positioning time for the candidate request is calculated using a suitable estimator (more on this in section 6.7). The three expected times are gleaned from online statistics collected by monitoring newly arriving read requests. Specifically, the expected positioning time for each process is a weighted average over time of the positioning time for requests from that process, as measured upon request completion. The decay factor is set to forget 95% of the old positioning time value after ten requests, so it adapts fast. We could alternatively track the expected seek distance of a request from the previous request issued by that process, and calculate expected positioning time on the fly.
Expected median and 95%ile thinktimes are estimated by maintaining a
decayed frequency table of request thinktimes for each process. Thinktimes
are computed from the point of completion of the last request for this
process, to the current time. If however, the scheduler already has a read
request queued for this same process, then we treat this new request as
asynchronous and set its thinktime to zero. We maintain 30 per-process
buckets that store the count of requests that arrive after various
thinktimes, ranging from 0 to 15ms at a granularity of 500
s per bucket.
We decay these bucket counts by reducing all of them to 90% of their
original value for every incoming request for that process. The thinktime
distribution usually looks like a bell curve. Assumption #2 guarantees that
this curve has at most one hump; for many applications, it is located at
about 1ms. Lastly, we calculate the median and 95%ile points of this curve.
We do all the above for every incoming synchronous request.
For a policy as conceptually simple as SPTF, this is the perfect greedy heuristic: it always tries to make waiting judgments with throughput as the only goal.
SPTF suffers from poor starvation resistance, potentially allowing distant requests to starve. Wrapping it in the NWCS solution only increases this likelihood. Therefore a weighted variant of SPTF has been proposed, which explicitly bounds response time. Aged-SPTF [SCO90,JW91,WGP94] remembers the queued time for individual requests, and continuously raises the priority of old requests. A request with sufficiently high priority overrules the SPTF decision.
Firstly, the SPTF heuristic would work unmodified. When ASPTF chooses a distant request that is too old, the SPTF heuristic would be unaware of this. It may decide to wait for additional, nearby requests. However, even if any new request from the last process arrives in this period, the policy will continue to pick the same old request. The last process will get blocked, and we would proceed with the policy-chosen candidate. This has worked exactly as desired, except for one unnecessarily wasted thinktime on each of the infrequent occasions that this happens. This minor suboptimality can be fixed by making the heuristic aware of the policy internals. Whenever a request is chosen that doesn't coincide with the corresponding SPTF selection, then we decide not to wait.
One extremely popular scheduling policy is the Elevator algorithm or C-LOOK, implemented in most Unix-based operating systems. This policy makes the head move in a unidirectional manner, servicing all requests in one direction, and then starting all over at the beginning. Though this policy only makes moderate sense from the point of view of either throughput or response time, we attempt an NWCS heuristic.
We base this on SPTF, with one additional clause. The statistics collection module in the heuristic additionally maintains a decayed expectation of the seek direction: forward or backward. On evaluating a request, if the current candidate involves a forward seek and the expected next request has a fairly high likelihood (more than 80%) of a backward seek, then we bypass the cost-benefit equation and decide not to wait. In the opposite case, we wait for the usual amount of time. For applications performing random access, with roughly 50% of the seeks pointing in each direction, this heuristic for C-LOOK is not ideal. This is because C-LOOK itself is poorly suited for handling this case.
We examine NWCS heuristics designed for proportional-share schedulers like Stride [WW95] and Yet-another Fair Queueing (YFQ) [BBG$^+$99]. Stride maintains weighted virtual clocks to remember the amount of disk service received by each process. A request is chosen from the resource container with the smallest virtual clock, so as to advance them in tandem.
Unfortunately, deceptive idleness forces these virtual clocks to go out of sync. Some resource containers do not generate enough requests in time, and their virtual clock lags behind. Genuinely idle containers also lag behind, but their expected thinktimes are high. Our heuristic for Stride is therefore as simple as waiting for requests from the process LP that issued the last request, if three conditions are met:
The Stride scheduler explicitly requires active processes to join the candidate pool, and those inactive for extended periods to leave. A joining process is assigned a virtual clock value equal to the global virtual clock at that time, which is effectively the minimum of virtual clocks of all active processes, if any. This ``bumping up'' of virtual clocks prevents an inactive process from accumulating virtual time and monopolizing the resource upon later activation.
In contrast, schedulers like Yet-another Fair Queueing (YFQ) and Start-time Fair Queueing (SFQ) [GGV96] deduce the notion of process idleness, by checking whether a queue is empty at the time a new request arrives. Deceptive idleness causes this condition to be insufficient. A heuristic for YFQ or SFQ thus needs to make waiting decisions, as described above for Stride. In addition, it needs to explicitly convey this information to the scheduling policy, which should use this as the idleness criterion for bumping up virtual clocks.
Consider a relaxed proportional-share scheduler that picks requests from
containers with virtual clock between
and
(where
could be 100ms). Among these, it chooses the request with
minimum positioning time [VGR98]. Such schedulers demand
general methods of combining NWCS heuristics for two separate policies, e.g.
proportional-share and SPTF.
A naive approach is to separately evaluate the candidate request on each of the two heuristics, and return the larger of the two return values. In other words, if the waiting decision is taken for either reason, then the combined heuristic will conservatively choose to wait.
There are two minor performance problems with this approach, based on the
nontriviality of combination of the two policies. If say the SPTF heuristic
decides to wait for a favourable request, but if it is known beforehand that
the combined policy will definitely not choose this request, then there is
little point in waiting. This can be solved by using more inside information
from the policy. Secondly, the proportional-share scheduler has relaxed due
to the introduction of
; the naive heuristic has an overly stringent
lagging-behind condition. The heuristic should check for potentially
lagging behind, which happens when the container bound to the process that
issued the last request has a virtual clock smaller than
.
We have not implemented the heuristic for any real-time scheduler, but we make an observation regarding the pure Earliest Deadline First (EDF) policy. At decision points, this policy chooses the available request with the smallest remaining time to deadline. It does not take seek effects into account, and is fused in practice with some seek optimizing scheduler.
Compare pure EDF with SPTF: at decision points, the latter calculates positioning times from the current head position to each available request, and chooses the smallest. The heuristic for SPTF can thus be morphed to one for EDF by substituting positioning time by deadline. In SPTF, one hopes that the last request issuer produces a new request with small positioning time. Correspondingly in EDF, the last issuing process may spend some thinktime and then issue a request with a small deadline. If the sum of the leftover thinktime and the deadline of the new request is smaller than the deadline of the candidate request, then pure EDF justifies us in opting to wait.
A more practical real-time scheduler would include other factors like positioning time for seek reduction, and total request service time to see whether the current candidate request can be squeezed into the expected deadline of the forthcoming request.
We implemented the NWCS framework and heuristics in the FreeBSD-4.0 kernel. The code comprises of a kernel module of about 1500 lines of C code, and a small patch to the kernel for necessary hooks into the scheduler and disk driver. Our experiments are run on a single 550MHz Pentium-III system, equipped with a 7200rpm IBM Deskstar 34GXP IDE disk and 128 MB of main memory. There are two practical difficulties: calculating positioning time for requests, and building an inexpensive event-based timeout mechanism.
Estimating access time for requests is nontrivial due to factors like rotational latency, track and cylinder skews, and features of modern disks like block remapping and recalibration. Nonetheless, much work has been done in this context [JW91,HC00,RW94], and it is possible to build a software predictor with over 90% accuracy. However, we used a much simpler block number based approximation to positioning time. We wrote a user-level program that performs some measurements (taking about 3 minutes) and fits a smooth curve through these points. This method automatically accounts for seek time, average rotational latency and track buffers. This has an accuracy of only about 75%, but the NWCS heuristics are generally insensitive to this error.
There are many possible timer mechanisms to choose from. We use the i8254
Programmable Interval Timer (PIT) to generate interrupts every 500
s,
and build a simple timeout system over that. Experiments demonstrate how
this rather inaccurate timer is amply sufficient for our purposes. Each
interrupt causes a processing overhead of about 4
s on our
hardware [AD99], thus causing about 1% CPU overhead on
computational workloads. Other timeout mechanisms can be used in place of
the i8254, if higher accuracy is desired. Some pentium-class processors
(mostly SMPs) have an on-chip APIC that delivers fine-grained interrupts
with an overhead of only 1 to 2
s per interrupt. Alternatively,
soft-timers [AD99] pose an extremely light-weight
alternative.
The NWCS algorithm, as presented above, applies to IDE disks that have no additional queueing within the controller. It requires only one modification to work for SCSI disks with tagged queueing: the BUSY state variable needs to be extended to an integer that holds the number of outstanding requests. The algorithm automatically and correctly handles some subtle issues raised in this context, as follows.
Even when the disk is busy, schedule() gets invoked multiple (usually up to 4) times. The controller expects the scheduler to supply as many requests if available, and performs seek reduction (usually SPTF) among those internally queued requests. A scheduler that efficiently eliminates deceptive idleness should therefore do the same under normal circumstances, but effectively disable tagged queueing whenever deceptive idleness sets in. This prevents the disk controller from scheduling a request before the subsequent request from the last process has arrived.
The above heuristic (with the BUSY modification) does exactly this, because whenever the heuristic decides to wait, the scheduler pretends (to the controller) as though there are no more requests available. Thus if a process does sequential disk access with asynchronous prefetch, both these requests will get scheduled to the controller, and no more (e.g. requests from other processes). This exploits tagged queueing just as much as is performance critical. If a process performs sequential disk access without asynchronous prefetch, then requests will get scheduled one after another - like IDE controllers. For processes with huge think times, or for random disk access, the scheduler never decides to wait. So tagged queueing gets exploited to the fullest.
There is a subtle problem with above reasoning, but it automatically gets resolved. The scheduler knows the set of requests that have been dispatched to the controller, and that one of them is currently receiving service. It doesn't know which of those requests will be the last one to execute. It needs to know this for two reasons.
Our proposed heuristics collect statistics about application behaviour, assume them to be valid, and use them to make decisions. We suggest two approaches to improving this NWCS solution, in both of which the heuristic has a smaller chance of getting misled. These are aside from the obvious ones of making the timeouts and the positioning time estimator more accurate.
A few incorrect decisions are usually acceptable, but sustained errors are not. We can therefore take a feedback-based approach to supplement or validate our heuristic knowledge with additional measurements:
(1) In addition to tracking expected thinktimes and positioning times, we could collect information about the variance of these estimates. This gives the heuristic an idea of how accurate these estimates really are. (2) We could keep track of how frequently timeouts expire for each process. If these exceed some threshold rate, then regardless of all other notions of accuracy, we know that something is wrong. (3) We may not be able to predict positioning times accurately, but we can measure it after the request has completed service. This tells us something about the error in the estimator, and thus, our confidence in the correctness of a future decision. (4) Some applications use aio_read to issue requests synchronously; we can determine this post-facto, and remember the fact for future decisions.
Some proportional-share schedulers work at a higher level of abstraction than processes, like resource containers [BDM99] and reservation domains [BGÖS98]. Sometimes a group of processes may collectively issue requests. Applications often simultaneously generate different access patterns on different file descriptors. Some programs may issue two kinds of disk requests from two different parts of the program code, but on the same file descriptor. Seek reduction intrinsically deals with requests in the same region on the disk.
The general solution to all this is to collect statistics at each level of abstraction, i.e. processes, threads, instruction pointer for thread, file descriptors, and disk region - along with their variances. The heuristic then chooses the highest consistent level out of these, which has low variance and is expected to be correct. More tricks can be played to reduce computational overhead of such a system.