This section evaluates the above NWCS solution on several microbenchmarks and real workloads. All experiments are run with and without the NWCS module enabled; our underlying claim is that this transparent kernel-level solution can eliminate deceptive idleness. With practically no overhead, it can achieve significant performance improvement and adherence to QoS objectives in a variety of cases, and almost never any degradation.
Three sets of microbenchmarks serve to illuminate the workings of the NWCS solution in a range of circumstances. We look at variations in access patterns and thinktimes for seek reducing schedulers, followed by the behaviour of a seek-reducing proportional-share scheduler.
All seek reducing experiments below use the ASPTF scheduler, which performs shortest positioning-time first scheduling with a bounded delay of 1 second. This achieves performance to within 1% of SPTF, while preventing starvation. For proportional-share scheduling, we use Stride coupled with ASPTF, with a tighter fairness threshold of 100ms (smaller than the switching threshold of ASPTF).
We employ two metrics of application performance: the application-observed throughput (in MB/s), and the disk utilization. Disk utilization in a time interval is the fraction of time that the disk spends servicing requests. This includes head positioning time and data transfer time, and excludes idle time (i.e. utilization is not the fraction of time spent doing useful work).
This experiment demonstrates performance improvements due to NWCS, with and without workarounds like asynchronous filesystem prefetch. Two processes rapidly issue 64 KB read requests into separate large files, using either read or mmap. These accesses are either sequential, or for every alternate 64 KB chunk, or randomly positioned within their respective files (figure 7.1).
Read induces kernel-driven asynchronous prefetch for sequential access, and hence achieves almost full disk bandwidth. File layout forces it to occasionally skip a block, and being conservative, prefetching becomes imperfect. The NWCS solution improves this by 5%, by steadily fetching blocks for one file until ASPTF forces it to switch. On the other hand, page faults in the mmap-ed region do not issue any asynchronous prefetch. NWCS achieves four times better throughput than before, while causing the disk to remain idle for about 5% of the time. This trend will be reflected later: moderate improvements for read, large ones for mmap.
Accessing alternate chunks defeats the FreeBSD prefetch heuristic. Both read and mmap achieve only 5 MB/s; in fact, read can go much lower for smaller chunks (mmap cannot, because of synchronous prefetch). NWCS improves throughput to the maximum that can be achieved for alternate blocks, i.e. half the disk bandwidth. In the random access case, the minor (5% and 7%) improvements by NWCS is because each process is performing random access within its respective file, where some seek optimization is possible. These two lie in the gray area between sequential access (where kernel-level prefetch is usually possible) and unoptimizably random access.
This set of four microbenchmarks illustrates the impact of waiting for applications that take different amounts of time to issue the subsequent request. Two processes map separate, large files into memory, and access these pages sequentially (thus avoiding kernel prefetch). For every 64 KB, they pause for some amount of time.
For NWCS: when
, we see the familiar situation where throughput is
four times that of the original system. For larger values of
up to
8ms the effect of waiting becomes increasingly burdensome on throughput
and utilization, and this huge improvement steadily reduces. At about
8ms, the waiting time becomes comparable to request service time, and
the cost-benefit equation tips the other way. It starts approximating
the original system to an increasing degree, until for very large
thinktimes (i.e. on an almost idle system) it plays no role.
Most applications have very short thinktimes when busy, in the region of
200
s to 2ms. Hence NWCS is expected to fetch large benefits.
Many timeouts expire in this experiment, so it proves useful for judging our
sensitivity to the accuracy of the timer. We changed the granularity so as
to tick every 50
s, 200
s, 500
s and 1ms. There was less than
10% of difference in throughput between these trials. This is also
supported by a similar experiment on the Apache webserver, where the
difference was negligible.
The following microbenchmark demonstrates the functionality of a combination
of the proportional-share and the seek-reducing heuristics. As before, two
processes
and
sequentially access separate memory-mapped files
rapidly. The underlying stride scheduler is assigned proportions of 1:2 for
the two processes (with
getting the higher share). These proportions are
in terms of utilization, not throughput.
Consider figure 7.5. In the original case, the scheduler
multiplexes between requests from the two processes, and achieves 1:1
proportions with the low throughput of 5 MB/s. When we turn on the
proportional-share heuristic, it realizes that process
is falling
behind, and waits for it. Taking seeks into account (9ms seek time, 3ms data
transfer time for every request), it achieves 1:2 proportions by servicing 6
requests from
for every request from
. This can be calculated to
yield slightly more than twice the throughput of the original case, and
indeed, it does. It results in a 2% drop in total utilization, as is
reflected by both allocations proportionally decreasing.
The above only waits for process
; when the seek reducing heuristic is
also turned on, it realizes the optimization potential in waiting for both
processes. This then services multiple requests from each process, thereby
achieving 21 MB/s and dropping total utilization to 95%, while retaining
proportions of 1:2.
If the requests were random, the proportional-share heuristic would still
wait for process
, though the seek reducing one wouldn't wait for either.
This demonstrates how our combination heuristic works under all
circumstances, eliminating the need for a real benchmark.
Solving deceptive idleness can clearly bring about significant benefits on microbenchmarks, but what is its impact on real applications? To see this, we use two real applications, and two standard benchmarks that are expected to mimic real scenarios.
The Andrew Benchmark [HKM$^+$88] understands filesystem
behaviour on a typical fileserver-like workload. It consists of each
client performing five phases: (a) mkdir, which creates
directories, (b) cp, which copies a repository of 71 C source
files to each of these
directories, (c) stat, which
aggressively lists all directory contents, (d) scan, which reads
all these files using grep and wc, and (e) gcc,
which compiles and links them. We used
copies of the
repository, which exceeds the main-memory cache. We configured this
benchmark with two clients to simulate concurrent access on a
fileserver, with each client acting on a separate instance of the
repositories. This experiment uses the same ASPTF scheduler as before,
with and without NWCS enabled.
A breakup of the execution times for individual benchmark phases is presented in figure 7.6 (The graph depicts the gcc bars scaled down by a factor of 3). Consider the scan phase, which is the only one that issues streams of synchronous read requests. NWCS transparently brings about a drop in execution time for this phase by 54%. Both grep and wc on FreeBSD use read, not mmap, and thus would have the advantage of kernel prefetch. However, individual files are small, so this prefetch plays practically no role. Major seek optimization happens here due to the files being in the same directory, and thus closely positioned on disk. NWCS enables the scheduler to capitalize on these seek opportunities and halve the execution time.
Other disk-intensive phases improve by smaller amounts: 16% for mkdir with synchronous metadata writes, and 5% for cp and stat each (the latter actually gets cached in memory). The CPU-intensive gcc phase shows an execution time degradation of 1.7%, due to the i8254 timer and the NWCS heuristic execution overheads. This phase strongly dominates total execution time, so that the overall benchmark shows an improvement of 8.4%.
Performance with one client is the same with or without NWCS; indeed, when there is only one stream of synchronous requests, NWCS plays no role. Increasing the number of clients to 8 shows practically the same performance as two clients: the scan phase improves by 57% in this case. This confirms the applicability and scalability of NWCS to busy fileservers.
The Apache webserver [Apa] employs a multi-process architecture to service requests from clients. Requests that miss in the cache are serviced from disk by the respective process. This happens frequently for webservers with large working sets, to the point of being disk-bound. For files smaller than 4 MB, the webserver mmaps the file and writes it out to socket; for larger files, it reads the data into application buffers (this was done to prevent some swap-based DoS attack on IRIX systems). Many other webservers and ftp servers use similar mechanisms for file transfer.
We configure this webserver to always use either read or mmap, and present results separately for both. We ran the Apache webserver with 3 client machines and 16 client processes each; this should generate enough concurrency that is typical of webservers. We tried varying the number of clients, but it made little difference to performance. These clients rapidly issue random requests for distinct files. The files themselves are chosen from a real workload corresponding to the CS department webserver at Rice University. These files have an average size of 34 KB. The file distribution is such that 95% of these files occupy 238 MB, whereas 99% occupy 814 MB. The scheduler, as before, is ASPTF.
Figure 7.7 characterizes the observed throughputs and utilizations. As before, we observe a large 56% improvement in throughput for mmap, but only about 16% for read. Unlike in the Andrew Benchmark, all clients to apache generate random requests to the entire repository, so requests to an individual apache process do not possess any special locality across files. So seek reduction opportunities are mainly in terms of servicing each file fully before moving on to the next. Many files are too small for any optimization. The intermediate-sized files are candidates, but conservative filesystem prefetch does not occur. NWCS makes the 16% difference in this region. Prefetch occurs for reads on large files, but not for mmap. This accounts for the large difference in performance between the two methods of access. In the default configuration (with mmap or read depending on file size), apache performs closer to the read case, indicating how files larger than 4 MB dominate performance.
This experiment involves the last stage of a FreeBSD kernel build, starting from a cold filesystem cache. The GNU linker reads 385 object files from disk. 75% of these files are under 10 KB, whereas 96% are under 25 KB. After reading all their ELF headers, GnuLD performs up to 9 (usually about 6) small, non-sequential reads in each file, corresponding to each ELF section. These reads are separated by computation required for the linking process.
The experiment in figure 7.8 demonstrates the performance of one and two simultaneous instances of GnuLD on disjoint repositories. We use two schedulers this time, ASPTF and C-LOOK. With one synchronous request issuer process, both schedulers result in execution times of about 1.8 seconds each. We would expect this to double with two instances of GnuLD. However, deceptive idleness (and no prefetching) causes an execution time rise by a factor of 5.5 instead.
NWCS brings about a benefit of 68% in the ASPTF case, and causes performance to scale almost exactly as needed. The C-LOOK scheduler, on the other hand, always insists on unidirectional movement. Since the object files are accessed in arbitrary order, C-LOOK does not allow NWCS to optimize fully. Hence we achieve a performance improvement of only 48%, and an execution time 56% higher than the ASPTF case.
The TPC-B benchmark [Cou94], proposed by the Transaction Processing Council in 1994, exercises a database system on simple, update-intensive transactions. Though it is declared to be outdated, it serves to illustrate the impact of NWCS on a read-write workload. The essential idea is to have clients making random update queries into a large database. We use two clients in our experiments. Individual records in the database are expected to be at least 100 bytes large. Unfortunately, the database system (MySQL) has computational overheads that make it CPU-bound for record sizes of 100 bytes, or even 1 KB. We therefore use 64 KB records, and a database size that exceeds main memory size.
Figure 7.9 shows four experiments. The first one has all queries directed towards one database, so seek optimization opportunities are few. The second bar has requests targetted from the two clients to two separate database files. This leads to choices of seeking within and between the files. The third and fourth experiments are similar, but the update operation is replaced by just a select.
An update query reads the record first, and then issues a delayed write. The presence of enough delayed writes can give the scheduler more choices, and reduce the effect of deceptive idleness. This happens in the first experiment, where the net improvement is just 4%. When we have two databases that are physically separated on disk, the impact of NWCS is better seen, as in the 28% improvement. Finally, the NWCS advantage is best brought out without any delayed writes, i.e. when the update operation is reduced to just a select. We observe throughput improvements by 17% and 60% for the same and different databases respectively.