next up previous contents
Next: 6 Non-work-conserving scheduler solution Up: The Effect of Deceptive Previous: 4 Effects of deceptive   Contents

Subsections


5 Workarounds for deceptive idleness

We have seen how a variety of fairly commonplace scenarios can easily develop deceptive idleness, and how this can result in most disk schedulers behaving in an undesirable manner. On the other hand, we ideally want to be able to plug any disk scheduler into an operating system and expect it to consistently function as intended. Thus we feel the need for solutions that address deceptive idleness.

We immediately recognize a possible workaround, based on asynchronous prefetch. If every active request issuer somehow proactively maintains even a single outstanding request of the right kind at decision points, then deceptive idleness can be prevented completely. The scheduler will have the option of choosing requests from any of the candidate issuers, including the one that issued the last request (this process is no longer mistaken to be idle). Preventing deceptive idleness thus requires either the application or the kernel to issue one or more supplementary requests before the current request has finished; these generally assume the form of various forms of asynchronous prefetch.

1 Application-driven prefetch

Applications can embrace programming paradigms and techniques that prevent the onset of deceptive idleness. We discuss several approaches, in each of which the application takes care to issue multiple outstanding requests of the right kind. By kind, we mean contiguous or proximal requests for seek reducing schedulers, without which there is little point in avoiding deceptive idleness.

1 Multi-process architecture

Perhaps the most intuitive approach is that of designing the application with multiple processes or kernel threads. If the application collectively creates enough disk activity, then we would expect several of these processes or threads to produce outstanding requests. Indeed, webservers like Apache and Flash are based on such a programming model. If we desire proportional disk service allocation between two apache servers, then the scheduler always has enough requests to take sensible action.

Naturally, it would be difficult to expect all applications to be amenable to a multi-process or multi-threaded model. For example, a single process architecture with a sequential programming style may be perfectly suited to a movie player application.

There is also a performance problem with the above: requests issued by multiple processes often correspond to independent activities, and possess little spatial locality. For a seek reducing scheduler, these requests are not of the right kind. Though deceptive idleness is solved at the level of the entire application, it is still present in individual processes, which in this case matter for seek reduction. This results in a naive multi-process application (like apache) achieving low disk throughput due to seeks between requests from its processes. An exception to this is when processes explicitly coordinate to issue a synchronous stream of requests, like the implementation of asynchronous I/O in FreeBSD.


2 Asynchronous I/O

If an application is aware of its future request issue pattern, it could proactively issue asynchronous prefetch requests. Alternatively, it can roll its own asynchronous I/O implementation using multiple processes or kernel threads. This method could be powerful and accurate, since the application is generally in the best position to estimate its future request pattern. Since the application controls the request being issued, we do not have the performance problem usually incurred by a multi-process architecture.

This section demonstrates how the aio_read() system call can be effectively used to construct two parallel streams of requests, thus providing multiple outstanding requests at every instant. The semantics of aio_read() are to issue a disk request and return immediately. A pre-specified signal is delivered to the process when the request finishes. Thus, issuing another aio_read() at the end of this signal handler will result in a single chain of requests. However, here we need two streams of independent disk requests to prevent deceptive idleness. We therefore set up two disjoint copies of this request chaining mechanism, each with its own signal handler (SIGUSR1 and SIGUSR2). These handlers need mutual exclusion between themselves, in order to select the next request consistently. This is easily achieved by masking each signal from the other while they are being handled, using sigaction() to set the signal mask.

There are several drawbacks to using asynchronous I/O. Firstly, asynchronous prefetch may not be possible even with complete application support, when the forthcoming request is intrinsically dependent on the currently executing one. For example, filesystem metadata or database index traversals, future requests for webservers, etc. Such a programming model is quite cumbersome to adopt, and obviously needs most applications to be rewritten for this purpose. Moreover, the implementation of asynchronous I/O in FreeBSD is still quite unstable (aio_read is part of an optional POSIX realtime extension), and is not enabled in the default kernel configuration. Issuing explicit read requests using Unix API functions (instead of memory mapping the file) may entail more data copying and cache polluting, which could become expensive for in-memory workloads [PDZ99].

Thus, though asynchronous I/O provides us a powerful application-based workaround to address deceptive idleness, it may not be sufficient or desirable.

3 An mmap/mincore/read interface

The I/O subsystem in FreeBSD issues asynchronous prefetch requests for sufficiently large and sequentially accessed files. However, the VM subsystem does not do so on page faults (more on this in section 5.2.4). It is possible to work around this problem. The Flash webserver does this, by issuing read() requests explicitly on a memory mapped region. However, the read system call has performance implications when reading cached data, due to data copying and multiple buffering overheads [PDZ99]. Flash thus uses the mincore() system call on a memory-mapped file, and issues read requests only if the data needs to be actually read from disk. The overhead of copying is somewhat small compared to disk access time. In short, applications that desire good performance on current systems are sometimes forced to implement system-specific hacks.

4 Metadata caching

Metadata (directories, inodes, indirect blocks) is usually scattered quite randomly on disk, so there are few seek reduction opportunities involved. So the issue becomes that of transient violation of proportions or transient missing of deadlines, if metadata gets automatically shared in undesirable ways. This especially happens at the start of application execution, when cache misses in memory are mostly compulsory misses. Many applications are insensitive to such transient effects.

But hard real-time applications, for example, cannot allow for missing even a single deadline. The application-level workaround here is to periodically access the files involved, thereby bringing all metadata information to cache. Future metadata accesses therefore never actually reach the disk.

In FreeBSD, the metadata caching mechanism maintains an implicit cache size threshold of a few thousand entries. Caching more entries causes eviction of old entries. This may be undesirable for some large repositories of files (e.g. the CS department webserver stores more files). FreeBSD provides an indirect mechanism to cache more directories, by enabling a sysctl named vfs.vmiodirenable. This makes the kernel view directories differently, and effectively increases the directory cache size (making it compete with the file cache).

Such situations are a rare occurrence, and such workarounds are clumsy. In general, it is often possible to eliminate deceptive idleness completely by taking proper precautions from the application. However, some of these are quite unnatural, and preferably avoided.

Secondly, modifying or rewriting applications may not even be allowed under many circumstances. A traditional general-purpose operating system may want to consistently provide seek-reduced best-effort service to existing applications. In some modern operating systems, resource management frameworks like resource containers [BDM99] and reservation domains [BGÖS98] allow for unmodified applications to be bound to the resource containers, and be allocated the designated resources.

In this world, it becomes essential for the operating system to be able to eliminate deceptive idleness, irrespective of what some potentially malicious application may try to do. A busy application that is expected to receive a fraction of the available disk resources should, under no pretext, be externally prevented from doing so. This idea forms the basis for analyzing transparent in-kernel solutions that solve deceptive idleness.

2 Kernel-driven prefetch

Filesystems can (and most do) try to guess future request patterns for applications, often at the granularity of file descriptors. They issue separate asynchronous prefetch requests3 for them. The usual reason is to overlap computation with I/O [SSS99], but this prefetching prevents deceptive idleness too. The primary purpose of this section would be to understand the limited scope in which this can be successfully performed in various situations. The underlying message is that asynchronous prefetch becomes infeasible to implement and function correctly in a number of circumstances, some of which are quite important for some disk scheduler types. We examine a progression of such reasons. In each of these, the penalties of misprediction are high, thus forcing the prefetch heuristic to be very conservative.

1 Async prefetch may be completely impossible

We have seen examples of how request dependencies can be intrinsic in the nature of these requests. Database index traversal, or similarly, metadata access involve dependent disk accesses that not even the application can predict in advance. So any form of prefetch, including those initiated by the application, becomes impossible to issue.

2 Async prefetch may be impossible from the OS

Some applications are aware of their disk access pattern in advance, but to the kernel, this could appear totally random. Intelligently issued asynchronous I/O from the application is capable of solving this problem, but transparently prefetching from the kernel is not possible. In this and the previous cases, it may be argued that the benefit gained out of seek reduction is usually quite small due to the random nature of disk access. However, there is a gray area between sequential and unoptimizably random requests, where the predictive power of most prefetching heuristics does not suffice. For example, consider two processes, each accessing every alternate 64 KB chunk in different large files (one file per process). Neither FreeBSD nor Linux possess prefetch detection algorithms sophisticated enough to issue suitable asynchronous prefetch requests for such nontrivial access patterns. Eliminating deceptive idleness could fetch a benefit of a factor of two in this case.

Also, proportional-share schedulers do not care about request placement: they are expected to work equally well for sequential as well as random disk access, and in the latter case, we clearly need solutions more feasible than asynchronous prefetch.


3 Async prefetch may need large files to detect

The I/O subsystem of FreeBSD issues asynchronous prefetch requests for the read system call, whenever it detects sequential disk access. But this involves internal logic that maintains some state, and detects and reacts to several sequential disk accesses. Thus, accessing medium-sized files may be regarded as random, simply because the conservative prefetch has not yet kicked in. This can adversely affect the performance of applications like the apache webserver that generally access medium-sized files.


4 Async prefetch may not be implemented: practical difficulties

Despite the possibility of performing prefetch, some subsystems of some operating systems are constrained by fundamental design choices and limitations that prevent them from doing so.

Perhaps the most conspicuous example of this is the VM subsystem in FreeBSD (as of 5.0) not issuing any asynchronous prefetch requests for disk accesses initiated by page faults. A read() request supplies a request size along with the block number, whereas a page fault provides less information. Consequently, the logic that detects and issues prefetch becomes more complicated, and is just not implemented. To compare, Linux has started supporting prefetch pagefaults only recently. This has serious consequences: webservers map small files into memory and directly write them out to a socket. The kernel incurs multiple page faults on this data, and immediately encounters deceptive idleness.

The sendfile() system call in FreeBSD issues explicit back-to-back disk requests, without asynchronous prefetch. In this case, prefetch should be relatively straightforward to implement, and is just not done so yet.

Large directories in FreeBSD, despite having approximately the same on-disk structure as regular files, do not generate asynchronous prefetch on being accessed. This is because readdir() uses low-level interfaces that do not trigger prefetch mechanisms. This is an example of how OS design choices hinder the issue of asynchronous prefetch.

1 Common examples of this problem

One manifestation of deceptive idleness, viz. low disk utilization in seek reducing schedulers due to page faults not issuing asynchronous prefetch is a somewhat well-known phenomenon in the application development community. This is primarily regarded as a specific failure of some operating systems to provide asynchronous prefetch for page faults; fixing them would solve these concerns.

To avoid a possible denial-of-service attack on IRIX systems, the apache webserver serves files smaller than 4 MB by mapping them into memory and writing them out onto the socket. Beyond 4 MB, it switches strategies to reading them directly into a local buffer using fread(), and writing these chunks out. Ftpd-6.00LS in FreeBSD-5.0 does something similar, except that it hardcodes the switching point (again arbitrarily) to 16 MB.

The tradeoffs are hairy in either direction. Using read() is clearly beneficial if the kernel does not implement asynchronous prefetch for servicing page faults from disk. On the other hand, the mmap approach is strongly preferable for files served from cache, because of many reasons: it reduces data copying (and thus cache polluting) costs, and it doesn't occupy space in the application's footprint for receiving and perhaps caching the data [PDZ99]. These costs are somewhat tolerable in comparison to disk I/O.

Figure 5.1 illustrates an experiment that studies the performance improvement of CSCAN with and without asynchronous prefetch. The apache webserver is used unmodified except for changing the MMAP_LIMIT threshold, to get apache_read and apache_mmap. The workload consists of a repository of many equally sized files, which are requested back-to-back by two clients. This experiment exercises only the disk subsystem, so copy avoidance and other benefits of mmap are not shown.

Figure 5.1: Impact of asynchronous prefetch on seek-reducing scheduler performance
\begin{figure}%%\centerline{\psfig{figure=../x/old/cscan/ara.eps,angle=270,width...
...ig{figure=../../x/old/cscan/ara.eps,angle=270,width=0.6\linewidth}} \end{figure}

For small files, we observe apache_read behaving almost as badly as apache_mmap. As the filesize increases, slight amortization of other overheads causes apache_mmap to improve marginally. In contrast, the performance of apache_read shoots up to 75% of the maximum possible disk bandwidth. This is because the latter capitalizes on seek reduction opportunities by issuing asynchronous prefetch requests. Apache, in its default configuration, would pay the performance penalty of mmap until filesize reaches 4MB, and follow the read curve thereafter.

Thus, the kernel implementing asynchronous prefetch on page faults would completely eliminate the need for this messy tradeoff with seek-reducing schedulers, when requests are actually sequential. Of course, in this situation, applications streaming large files may perhaps want the kernel to implement some scheduler other than CSCAN (like ASPTF or GSPTF), to avoid incurring huge maximum response times.

To bring out the deeper nature of deceptive idleness, we later evaluate our solution using some experiments that are not limited by the above weaknesses in FreeBSD's asynchronous prefetch implementation.

5 Summary of this section

The key point of this section is that asynchronous readaheads implemented and always correctly functioning in every I/O processing mechanism in the kernel is sufficient for solving deceptive idleness. However, prefetch mechanisms demand an accurate prediction of future access, have huge misprediction penalties, and are either impossible or difficult to implement for a variety of reasons. We therefore propose an alternative solution based to deceptive idleness, based on a non-work-conserving scheduling framework.



Footnotes

... requests3
different from synchronous readahead, where requests are enlarged to 64 KB to amortize seek costs over larger reads.

next up previous contents
Next: 6 Non-work-conserving scheduler solution Up: The Effect of Deceptive Previous: 4 Effects of deceptive   Contents
Sitaram Iyer ssiyer@cs.rice.edu