The immediate question, therefore, is of why a disk scheduler might possibly develop an incorrect understanding of request issuer idleness. This section presents a categorized picture of these sources of deceptive idleness, without worrying about any of their consequences. The underlying message here is that many of these are fairly commonplace situations brought about by operating system design choices and standard programming practices.
We identify two fundamentally different reasons for deceptive idleness: request dependencies, and contention for some other resource. In practice, each of these could manifest in many different ways, potentially originating at the application level or from within the kernel.
Imagine a request issuer generating a single, continuous stream of back-to-back disk requests. A few hundred microseconds after one request completes and the issuer gets control, it issues the next one. So from the application's point of view, this request issuer is never idle.
Consider the same scenario from a scheduler's viewpoint. A work-conserving scheduler is compelled to issue the next request (if available) immediately after completion of the previous one. At this decision point, control has not yet been returned to the issuer of the previous request. This issuer is waiting for the outcome of the previous request before issuing the subsequent one, and is unable to do so. Hence there are no requests pending for this issuer, leading the scheduler to falsely assume upon its idleness. This is an example of how synchronous request generation can lead to a misunderstanding of request issuer idleness.
Not surprisingly, this effect manifests itself in practice in various different ways. If an application implements a single loop in which it invokes the read() system call on random disk blocks in a file, then we notice a chain of dependent requests at the application-kernel boundary. Upon completion of one request, control needs to be returned to userspace for the issuer to generate the next request.
In fact, applications running in user mode are not the only sources of dependent requests. Certain operations like write from file to network, and the sendfile system call have the I/O subsystem as the request issuer. These can potentially generate a stream of dependent read requests from within the kernel. A context switch to the application is not required for generating the subsequent request, yet we observe deceptive idleness due to the work-conserving scheduler acting immediately.
It may be argued that some of the above can be solved by suitably rewriting the application to use asynchronous I/O. However, this is not always true: sometimes request dependencies are unavoidably inherent in the very nature of these disk requests. This is the case with many common applications; for example, the apache webserver does not know of a future disk request before receiving the request. Some database index traversal operations may necessarily have to be performed one after another.
Such conditions may also naturally arise within the operating system. For instance, reading a file typically involves reading some associated metadata first. This could potentially involve several disk reads for inodes, indirect blocks and directory contents, which need to be performed one after another. As a pathological case, consider a fairly long chain of symbolic links that need to be traversed to access a particular file. Further, imagine that each of these symbolic link files is buried deep within a directory hierarchy with no common path components between them. To access the target file, many dependent disk accesses would be required.
Performing multiple simultaneous I/O operations (instead of issuing asynchronous I/O) may also prove insufficient for solving deceptive idleness. Imagine a multi-process webserver that reads many small files scattered over an intricate directory structure. Each process incurs metadata dependencies, but the server as a whole would seem to issue enough requests whenever required. However, this need not always be the case: if many of these accessed files have a significant number of their path components in common, then they implicitly share the metadata associated with those directories. As a result, reading even different files may entail reading the same metadata block, thus generating a single stream of dependent requests for the entire server.
Even if request dependencies are altogether avoided, it may be possible to encounter deceptive idleness for quite different reasons. For example, there may be factors within the kernel that prevent an application from making progress at certain crucial moments. Such internal blockage could be a direct consequence of the disk scheduler. Despite the application trying to reach the point where it can issue requests, the scheduler will observe no requests actually being issued, and assume the issuer to be idle.
A situation like this could arise in practice due to contention for the main memory resource. A process that allocates and uses enough memory will eventually be forced to pageout some of it to swap. When this happens, another process that needs memory before issuing a disk request gets blocked, waiting for the first to finish paging itself out. Thus from the application's point of view, there is no reason why it should not continuously issue disk requests. However, the scheduler momentarily sees no disk requests emanating from the second process, thereby considering it idle and usually takes incorrect action. This effect somewhat resembles priority inversion, where a high-priority process contends for a kernel resource that is currently held by a low-priority process.
We characterize all these sources of deceptive idleness, by performing a simple experiment. A Stride scheduler is assigned 1:2 proportions to two resource containers. These containers thus constitute the request issuers for this scheduler. One or two processes are bound to each container, as specified later. These processes continuously perform various actions as described, and never remain idle. The experiment involves measuring at every decision point, the number of request issuers (0, 1 or 2) that are busy according to the scheduler (i.e. have pending requests at those moments). These are averaged over periods of one second, so we could get nbusy values continuously between zero and two. Ideally we want this number to be always equal to two, so as to reflect the application's notion of non-idleness.
We bind two processes to each resource container, and arrange for each of these processes to continuously issue randomly positioned disk requests. So at every decision point, at least one of the two processes in each resource container always has a request pending. Thus the number of busy resource containers (nbusy) is practically always 2, in figure 3.1.
This experiment binds just one such process to each resource container. As soon as a request from either container finishes being serviced, the process bound to that container has not yet issued another request, and thus appears idle to the scheduler. We observe the effect of application-originating request dependencies, and nbusy is almost continuously near 1 (figure 3.2). The scheduler is thus never able to successively service multiple requests from any container.
We get an identical graph from request dependencies originating within the kernel. In this case, we map files into memory, and without actually accessing the contents, write them out onto a network socket. This is how many webservers (e.g. apache) normally operate. The kernel copies this memory region into network buffers, and incurs a stream of page faults. Handling these involves issuing a synchronous stream of 64 KB disk requests. The kernel could inform the scheduler that these requests are actually part of one continuous stream, but FreeBSD does not do that. Between requests, the scheduler incorrectly assumes that the issuer (I/O subsystem) has become momentarily idle.
The following experiment illustrates a slightly extreme case of sharing metadata between request issuers. We create a repository of small files, embedded in a four-level directory hierarchy. We allocate two resource containers, and bind two processes to each. These four processes read these small files in order, making sure they don't read a file twice. Thus, they are implicitly forced to read the same metadata most of the time, and these have inherent dependencies in them. We therefore encounter short bursts of dependent metadata requests followed by short bursts of independent file data requests. These average to about 1.5 busy resource containers at any time. In the first few seconds, some higher level directories also needs to be read, so nbusy drops even further in that period (figure 3.3).
Resource contention for memory could lead to deceptive idleness in some pathological circumstances. In this experiment, two processes bound to each of the two resource containers issue random disk requests. From the application's point of view, these requests keep each resource container always busy.
At the same time, each process allocates and consumes up to five times that much memory, and doesn't free them. This can be viewed, for example, as memory used to store various processed versions of the data read from disk. When available physical memory in the system gets exhausted (about halfway into the experiment), the pageout daemon gets activated and starts writing to swap space. We associate these write requests with the appropriate resource container, depending on who owns the memory that is being written out.
While choosing pages for eviction, the kernel does not distinguish between processes that own these pages; it simply runs a global clock algorithm. It may therefore happen that one process gets blocked on memory allocation, and can proceed only after a process from the other resource container has written large chunks of memory out to disk. This leads to deceptive idleness, since the scheduler assumes the first process to be idle in this period. Thus in figure 3.4, nbusy begins with the correct value of 2, then drops down to 1 when severe pageouts induce deceptive idleness.
None of the processes free any memory, and the system is thrashing heavily throughout the experiment. Soon after the depicted 60 seconds, the system runs out of swap space, becomes quite unresponsive, and kills the application. So this condition is a corner case: we shouldn't be in such extreme memory shortage at all.