next up previous contents
Next: 8 Related work Up: The Effect of Deceptive Previous: 6 Non-work-conserving scheduler solution   Contents

Subsections


7 Experimental evaluation

This section evaluates the above NWCS solution on several microbenchmarks and real workloads. All experiments are run with and without the NWCS module enabled; our underlying claim is that this transparent kernel-level solution can eliminate deceptive idleness. With practically no overhead, it can achieve significant performance improvement and adherence to QoS objectives in a variety of cases, and almost never any degradation.

1 Microbenchmarks

Three sets of microbenchmarks serve to illuminate the workings of the NWCS solution in a range of circumstances. We look at variations in access patterns and thinktimes for seek reducing schedulers, followed by the behaviour of a seek-reducing proportional-share scheduler.

All seek reducing experiments below use the ASPTF scheduler, which performs shortest positioning-time first scheduling with a bounded delay of 1 second. This achieves performance to within 1% of SPTF, while preventing starvation. For proportional-share scheduling, we use Stride coupled with ASPTF, with a tighter fairness threshold of 100ms (smaller than the switching threshold of ASPTF).

We employ two metrics of application performance: the application-observed throughput (in MB/s), and the disk utilization. Disk utilization in a time interval is the fraction of time that the disk spends servicing requests. This includes head positioning time and data transfer time, and excludes idle time (i.e. utilization is not the fraction of time spent doing useful work).

1 Different access patterns

This experiment demonstrates performance improvements due to NWCS, with and without workarounds like asynchronous filesystem prefetch. Two processes rapidly issue 64 KB read requests into separate large files, using either read or mmap. These accesses are either sequential, or for every alternate 64 KB chunk, or randomly positioned within their respective files (figure 7.1).

Figure 7.1: Different access patterns
\begin{figure}\centerline{
%%\psfig{figure=../x/micro/accpat/tput.eps,width=0.4...
... \psfig{figure=../../x/micro/accpat/util.eps,width=0.48\linewidth}} \end{figure}

Read induces kernel-driven asynchronous prefetch for sequential access, and hence achieves almost full disk bandwidth. File layout forces it to occasionally skip a block, and being conservative, prefetching becomes imperfect. The NWCS solution improves this by 5%, by steadily fetching blocks for one file until ASPTF forces it to switch. On the other hand, page faults in the mmap-ed region do not issue any asynchronous prefetch. NWCS achieves four times better throughput than before, while causing the disk to remain idle for about 5% of the time. This trend will be reflected later: moderate improvements for read, large ones for mmap.

Accessing alternate chunks defeats the FreeBSD prefetch heuristic. Both read and mmap achieve only 5 MB/s; in fact, read can go much lower for smaller chunks (mmap cannot, because of synchronous prefetch). NWCS improves throughput to the maximum that can be achieved for alternate blocks, i.e. half the disk bandwidth. In the random access case, the minor (5% and 7%) improvements by NWCS is because each process is performing random access within its respective file, where some seek optimization is possible. These two lie in the gray area between sequential access (where kernel-level prefetch is usually possible) and unoptimizably random access.

2 Varying thinktimes

This set of four microbenchmarks illustrates the impact of waiting for applications that take different amounts of time to issue the subsequent request. Two processes map separate, large files into memory, and access these pages sequentially (thus avoiding kernel prefetch). For every 64 KB, they pause for some amount of time.

A.  Symmetric processes

Consider figure 7.2, where time $ t$ on the x-axis represents the duration in milliseconds that both processes spend waiting. For values of $ t$ up to 8ms, the original system thrashes between requests from the two processes, fetching only 5 MB/s. At about 8ms, the waiting time becomes comparable to request service time, and utilization for the original system starts falling below 100%. Occasionally deceptive idleness gets avoided by servicing two successive requests for the same process. This fades away for larger values of $ t$.

Figure 7.2: Increasing thinktimes for both processes
\begin{figure}\centerline{
%%\psfig{figure=../x/micro/time/both-tput.eps,width=...
...e=../../x/micro/time/both-util.eps,width=0.40\linewidth,angle=270}} \end{figure}

For NWCS: when $ t = 0$, we see the familiar situation where throughput is four times that of the original system. For larger values of $ t$ up to 8ms the effect of waiting becomes increasingly burdensome on throughput and utilization, and this huge improvement steadily reduces. At about 8ms, the waiting time becomes comparable to request service time, and the cost-benefit equation tips the other way. It starts approximating the original system to an increasing degree, until for very large thinktimes (i.e. on an almost idle system) it plays no role. Most applications have very short thinktimes when busy, in the region of 200$ \mu$s to 2ms. Hence NWCS is expected to fetch large benefits.

B.  Asymmetric processes

Consider an alternative scenario in figure 7.3 where only one (slower) process waits for duration $ t$. The original system thrashes for $ t$ up to 12ms, but beyond that, two or more requests arrive for the quick request for every request from the slow one. This causes partial avoidance of deceptive idleness, due to which performance gradually improves for increasing $ t$. With NWCS: throughput first decreases due to increased waiting for both processes, till a point (4ms) where it starts waiting for the quick process but not for the slow process. Throughput quickly rises back to the maximum, with requests from the slow process serviced only when ASPTF induces a switch. Note that ASPTF only guarantees non-starvation, not fine-grained fairness.

Figure 7.3: Increasing thinktimes for one process
\begin{figure}\centerline{
%%\psfig{figure=../x/micro/time/one-tput.eps,width=0...
...re=../../x/micro/time/one-util.eps,width=0.40\linewidth,angle=270}} \end{figure}

C.  Random thinktimes

We wish to know how badly an application will perform if it defeated the thinktime heuristic. Interestingly, if a process waits for a random duration uniformly distributed between 0 and $ t$, it performs almost as well as the deterministic counterpart. This is because the expected median thinktime is judged to be roughly $ t/2$, and the expected 95%ile thinktime becomes $ t$.

D.  Adversary

So we wrote an intelligent adversary. Two processes wait for different durations as follows: they issue $ n$ rapid requests, then wait for a duration that just exceeds the timeout set by the kernel, and repeat. Graphs for varying $ n$ are depicted in figure 7.4. For $ n
= 0$, the NWCS solution can cope with all requests arriving slowly. But for $ n$ between 1 and 4, the NWCS heuristic performs slightly worse than the original system. This indicates that even in the most convoluted applications, degradation expected is minimal. Interestingly, an analogous situation arises in practice when applications issue very large read requests, and the kernel breaks them up into 128 KB chunks. We resolve that by having the filesystem inform the NWCS heuristic of this fact.

Figure 7.4: Adversary application
\begin{figure}\centerline{
%%\psfig{figure=../x/micro/adv/adv-tput.eps,width=0....
...ure=../../x/micro/adv/adv-util.eps,width=0.40\linewidth,angle=270}} \end{figure}

Many timeouts expire in this experiment, so it proves useful for judging our sensitivity to the accuracy of the timer. We changed the granularity so as to tick every 50$ \mu$s, 200$ \mu$s, 500$ \mu$s and 1ms. There was less than 10% of difference in throughput between these trials. This is also supported by a similar experiment on the Apache webserver, where the difference was negligible.

3 Proportional-share scheduling

The following microbenchmark demonstrates the functionality of a combination of the proportional-share and the seek-reducing heuristics. As before, two processes $ p$ and $ q$ sequentially access separate memory-mapped files rapidly. The underlying stride scheduler is assigned proportions of 1:2 for the two processes (with $ q$ getting the higher share). These proportions are in terms of utilization, not throughput.

Figure 7.5: Seek-reducing proportional-share scheduler
\begin{figure}\centerline{\psfig{figure=../../x/micro/prop/prop1.eps,width=0.5\linewidth,angle=270}} \end{figure}

Consider figure 7.5. In the original case, the scheduler multiplexes between requests from the two processes, and achieves 1:1 proportions with the low throughput of 5 MB/s. When we turn on the proportional-share heuristic, it realizes that process $ q$ is falling behind, and waits for it. Taking seeks into account (9ms seek time, 3ms data transfer time for every request), it achieves 1:2 proportions by servicing 6 requests from $ q$ for every request from $ p$. This can be calculated to yield slightly more than twice the throughput of the original case, and indeed, it does. It results in a 2% drop in total utilization, as is reflected by both allocations proportionally decreasing.

The above only waits for process $ q$; when the seek reducing heuristic is also turned on, it realizes the optimization potential in waiting for both processes. This then services multiple requests from each process, thereby achieving 21 MB/s and dropping total utilization to 95%, while retaining proportions of 1:2.

If the requests were random, the proportional-share heuristic would still wait for process $ q$, though the seek reducing one wouldn't wait for either. This demonstrates how our combination heuristic works under all circumstances, eliminating the need for a real benchmark.

2 Real workloads

Solving deceptive idleness can clearly bring about significant benefits on microbenchmarks, but what is its impact on real applications? To see this, we use two real applications, and two standard benchmarks that are expected to mimic real scenarios.

1 The Andrew Benchmark

The Andrew Benchmark [HKM$^+$88] understands filesystem behaviour on a typical fileserver-like workload. It consists of each client performing five phases: (a) mkdir, which creates $ n$ directories, (b) cp, which copies a repository of 71 C source files to each of these $ n$ directories, (c) stat, which aggressively lists all directory contents, (d) scan, which reads all these files using grep and wc, and (e) gcc, which compiles and links them. We used $ n = 500$ copies of the repository, which exceeds the main-memory cache. We configured this benchmark with two clients to simulate concurrent access on a fileserver, with each client acting on a separate instance of the $ n$ repositories. This experiment uses the same ASPTF scheduler as before, with and without NWCS enabled.

Figure 7.6: Andrew Benchmark
\begin{figure}\centerline{
%%\psfig{figure=../x/ab/abench.eps,width=0.6\linewid...
...re=../../x/ab/abench.eps,width=0.6\linewidth,height=0.4\linewidth}} \end{figure}

A breakup of the execution times for individual benchmark phases is presented in figure 7.6 (The graph depicts the gcc bars scaled down by a factor of 3). Consider the scan phase, which is the only one that issues streams of synchronous read requests. NWCS transparently brings about a drop in execution time for this phase by 54%. Both grep and wc on FreeBSD use read, not mmap, and thus would have the advantage of kernel prefetch. However, individual files are small, so this prefetch plays practically no role. Major seek optimization happens here due to the files being in the same directory, and thus closely positioned on disk. NWCS enables the scheduler to capitalize on these seek opportunities and halve the execution time.

Other disk-intensive phases improve by smaller amounts: 16% for mkdir with synchronous metadata writes, and 5% for cp and stat each (the latter actually gets cached in memory). The CPU-intensive gcc phase shows an execution time degradation of 1.7%, due to the i8254 timer and the NWCS heuristic execution overheads. This phase strongly dominates total execution time, so that the overall benchmark shows an improvement of 8.4%.

Performance with one client is the same with or without NWCS; indeed, when there is only one stream of synchronous requests, NWCS plays no role. Increasing the number of clients to 8 shows practically the same performance as two clients: the scan phase improves by 57% in this case. This confirms the applicability and scalability of NWCS to busy fileservers.

2 The Apache webserver

The Apache webserver [Apa] employs a multi-process architecture to service requests from clients. Requests that miss in the cache are serviced from disk by the respective process. This happens frequently for webservers with large working sets, to the point of being disk-bound. For files smaller than 4 MB, the webserver mmaps the file and writes it out to socket; for larger files, it reads the data into application buffers (this was done to prevent some swap-based DoS attack on IRIX systems). Many other webservers and ftp servers use similar mechanisms for file transfer.

We configure this webserver to always use either read or mmap, and present results separately for both. We ran the Apache webserver with 3 client machines and 16 client processes each; this should generate enough concurrency that is typical of webservers. We tried varying the number of clients, but it made little difference to performance. These clients rapidly issue random requests for distinct files. The files themselves are chosen from a real workload corresponding to the CS department webserver at Rice University. These files have an average size of 34 KB. The file distribution is such that 95% of these files occupy 238 MB, whereas 99% occupy 814 MB. The scheduler, as before, is ASPTF.

Figure 7.7: The Apache webserver
\begin{figure}\centerline{
%%\psfig{figure=../x/apache/apache-tput.eps,width=0....
...\psfig{figure=../../x/apache/apache-util.eps,width=0.40\linewidth}} \end{figure}

Figure 7.7 characterizes the observed throughputs and utilizations. As before, we observe a large 56% improvement in throughput for mmap, but only about 16% for read. Unlike in the Andrew Benchmark, all clients to apache generate random requests to the entire repository, so requests to an individual apache process do not possess any special locality across files. So seek reduction opportunities are mainly in terms of servicing each file fully before moving on to the next. Many files are too small for any optimization. The intermediate-sized files are candidates, but conservative filesystem prefetch does not occur. NWCS makes the 16% difference in this region. Prefetch occurs for reads on large files, but not for mmap. This accounts for the large difference in performance between the two methods of access. In the default configuration (with mmap or read depending on file size), apache performs closer to the read case, indicating how files larger than 4 MB dominate performance.

3 The GnuLD linker

This experiment involves the last stage of a FreeBSD kernel build, starting from a cold filesystem cache. The GNU linker reads 385 object files from disk. 75% of these files are under 10 KB, whereas 96% are under 25 KB. After reading all their ELF headers, GnuLD performs up to 9 (usually about 6) small, non-sequential reads in each file, corresponding to each ELF section. These reads are separated by computation required for the linking process.

Figure 7.8: The GNU Linker
\begin{figure}\centerline{
%%\psfig{figure=../x/gnuld/gnuld.eps,width=0.6\linewidth}}
\psfig{figure=../../x/gnuld/gnuld.eps,width=0.6\linewidth}} \end{figure}

The experiment in figure 7.8 demonstrates the performance of one and two simultaneous instances of GnuLD on disjoint repositories. We use two schedulers this time, ASPTF and C-LOOK. With one synchronous request issuer process, both schedulers result in execution times of about 1.8 seconds each. We would expect this to double with two instances of GnuLD. However, deceptive idleness (and no prefetching) causes an execution time rise by a factor of 5.5 instead.

NWCS brings about a benefit of 68% in the ASPTF case, and causes performance to scale almost exactly as needed. The C-LOOK scheduler, on the other hand, always insists on unidirectional movement. Since the object files are accessed in arbitrary order, C-LOOK does not allow NWCS to optimize fully. Hence we achieve a performance improvement of only 48%, and an execution time 56% higher than the ASPTF case.

4 The TPC-B database benchmark

The TPC-B benchmark [Cou94], proposed by the Transaction Processing Council in 1994, exercises a database system on simple, update-intensive transactions. Though it is declared to be outdated, it serves to illustrate the impact of NWCS on a read-write workload. The essential idea is to have clients making random update queries into a large database. We use two clients in our experiments. Individual records in the database are expected to be at least 100 bytes large. Unfortunately, the database system (MySQL) has computational overheads that make it CPU-bound for record sizes of 100 bytes, or even 1 KB. We therefore use 64 KB records, and a database size that exceeds main memory size.

Figure 7.9: The TPC-B database benchmark
\begin{figure}\centerline{\psfig{figure=../../x/tpcb/tpcb.eps,width=0.6\linewidth}} \end{figure}

Figure 7.9 shows four experiments. The first one has all queries directed towards one database, so seek optimization opportunities are few. The second bar has requests targetted from the two clients to two separate database files. This leads to choices of seeking within and between the files. The third and fourth experiments are similar, but the update operation is replaced by just a select.

An update query reads the record first, and then issues a delayed write. The presence of enough delayed writes can give the scheduler more choices, and reduce the effect of deceptive idleness. This happens in the first experiment, where the net improvement is just 4%. When we have two databases that are physically separated on disk, the impact of NWCS is better seen, as in the 28% improvement. Finally, the NWCS advantage is best brought out without any delayed writes, i.e. when the update operation is reduced to just a select. We observe throughput improvements by 17% and 60% for the same and different databases respectively.


next up previous contents
Next: 8 Related work Up: The Effect of Deceptive Previous: 6 Non-work-conserving scheduler solution   Contents
Sitaram Iyer ssiyer@cs.rice.edu