Anticipatory scheduling: A disk scheduling framework to overcome deceptive idleness in synchronous I/O Sitaram Iyer Peter Druschel For the latest on this project, please check: http://www.cs.rice.edu/~ssiyer/r/antsched/ Code release : antsched-1.0, 23 August 2001 Author : Sitaram Iyer Table of contents: 1. Intended audience 2. Quickstart guide 3. Macrobenchmarks 4. Code layout 5. License 6. References Please read the associated SOSP 2001 paper first[1]. In a tiny nutshell, anticipatory disk scheduling forms a framework around the conventional disk scheduler to deliberately inject small, calculated delays before making scheduling decisions. This gives applications the chance to issue their next synchronous request in time for the scheduler's decision. Results include significant performance improvements over a variety of commonplace disk-intensive applications, and closer adherence to contracts when using proportional-share disk schedulers. Presented here is a prototype implementation of anticipatory disk scheduling for FreeBSD-4.3-RELEASE. It is composed of a kernel module, and a small kernel patch containing the required hooks. For nearby FreeBSD versions such as 3.x and 5.x, we expect just the patch to require a few minor modifications for the whole system to work. INTENDED AUDIENCE 1. Those who're curious to understand certain specifics about anticipatory scheduling that's not detailed in the paper. 2. Those who'd like to try out anticipatory scheduling and perhaps perform a few experiments of their own choosing. 3. Those who plan to implement and deploy anticipatory scheduling in OSs such as Linux, FreeBSD, Windows, etc., and would like a guideline implementation. QUICKSTART GUIDE Here is an overview on how to get this up and running, and try out a couple of simple microbenchmarks. Hardware requirements: Intel-based system, with an IDE or SCSI disk; a couple of gigabytes of available storage on that disk. 1. (Download and) install FreeBSD-4.3-RELEASE. http://www.freebsd.org/ A small version discrepancy in either direction should be managable. 2. Login as yourself (not root). Download and unpack the tarball from http://www.cs.rice.edu/~ssiyer/r/antsched/antsched.tar.gz It'll create a directory called antsched. $ gzip -dc antsched.tar.gz | tar xf -; cd antsched 3. Create rosh (a setuid root shell): $ cd etc; make; cd .. # this builds a couple of other things too [Now login as root] # chown root etc/rosh; chmod 755 etc/rosh; chmod u+s etc/rosh; exit $ PATH=$PWD/etc:$PATH; export PATH $ rosh -c id uid=0(root) gid=0(root) ... 4. Make a full copy of /usr/src/sys here in antsched: cp -Rp /usr/src/sys . Apply the provided kernel patch: patch -sp0 < kernel.patch There might be a few rejects if the FreeBSD version is different; it shouldn't be too hard to fix them. Modify the configuration file i386/conf/GENKERNEL to suit your personal peeves. It is based on GENERIC, with changes commented as #ssiyer. There's also a "KERNEL" that's specialized for my system; you might find that more optimized in places. Avoid SMP. Rebuild the kernel: cd sys/i386/conf; config GENKERNEL; cd ../../compile/GENKERNEL; make; rosh -c 'make install'; rosh -c reboot After rebooting: $ cd wherever/antsched; PATH=$PWD/etc:$PATH; export PATH If your kernel is named something other than GENKERNEL, edit Makefile in this directory and change the KERNELNAME variable. 5. Edit the (highly nonstandard) config.h. You need to do this. There are detailed comments inside; highlights: * The current implementation only supports one disk at a time, as specified in the configuration. * There's a nice selection of scheduling policies to choose from, including some seek reducing and some proportional schedulers. 6. If experimental disk is SCSI, then turn off tagged queueing. As root, camcontrol devlist # to determine your device ID camcontrol negotiate -T disable camcontrol negotiate # to verify it's turned off 7. Calculate the positioning-time curve for your disk: $ cd postime; make; cd .. # compiles and runs it using rosh # takes about three minutes (you need gnuplot installed; if it's not, then mkpostime will fail. If this happens, you don't have to rerun postime again! (1) install gnuplot now and rerun mkpostime, or (2) copy mkpostime and postime.tmp to some machine with gnuplot and awk, run mkpostime there, and copy postime_data.h back). 8. Try out the antsched module: (just to make sure it doesn't blow apart) $ make This should automatically make depend, make, make load, wait for you to press enter, and finally unload. Expect to see messages like: gcc.. gcc.. ld.. open: no such file or directory. <> Loaded .../antsched/src/obj.i386/antsched_mod.ko, id=7 Id Refs Address Size Name 1 6 0xc0100000 2bfd20 kernel ... 7 1 0xc40f1000 6000 antsched_mod.ko Press enter to unload antsched_mod.. <> Unloading antsched_mod.ko, id=7 Hopefully you emerged out of that unscathed. 9. Set up stuff for experiments. Create /larder as the experimental directory. If / is on the experimental disk (with 2GB free space), then $ rosh -c 'mkdir /larder; chown yourusername /larder' else mkdir wherever/larder and cd /; ln -s wherever/larder larder. (if "/larder" is not feasible, then change bench.h and Makefile) mkdir /larder/file; cd /larder/file df -k . # Avail should say at least 2,000,000 dd if=/dev/zero of=file0 bs=1048576 count=1024 # 1GB file dd if=/dev/zero of=file1 bs=1048576 count=1024 # 1GB file # shrink these files if space is at a premium Perfect. You're ready for some microbenchmarks. cd wherever/antsched. 10. Impact of anticipatory scheduling on mmap-ed sequential file access. (section 4.1 in the paper) The antsched module is not loaded yet; try it first this way. # Assume using bash. One long command below, press to stop. # etc/ is in $PATH, remember. BENCH="+ " bench seqreadmap 0 & BENCH="* " bench seqreadmap 1 & (read); killall bench we observe: ... + 2.34 MB/s, avg 2.55 MB/s, 0.00 s, 18.06 MB : -0.00 MB * 2.43 MB/s, avg 2.54 MB/s, 0.00 s, 20.50 MB : -0.00 MB So total bandwidth = 2.43+2.34 = 4.77 MB/s. Throughput is low here, since asynchronous prefetching does not happen, and deceptive idleness makes the disk head alternate between requests from the two processes. Now run this with anticipatory scheduling. $ make load ... + 10.32 MB/s, avg 7.92 MB/s, 0.00 s, 45.69 MB : -0.00 MB * 12.28 MB/s, avg 6.40 MB/s, 0.00 s, 38.69 MB : -0.00 MB The numbers are irregular, since many requests are serviced from each process before switching to another. So after 25 seconds, the kernel prints a message to the process' terminal: -----! snitch !----- avg 20188 21318 948 Total application-observed throughput is 20188/1000 = 20.2 MB/s. Disk utilization = 94.8%. These represent the averages between 5 and 25 seconds into the experiment. The above numbers match with the seq/mmap bars in Figure 4. NOTE: your mileage may vary. These numbers depend on many factors, such as disk characteristics, relative and absolute location of the two files on disk (and the partition containing them), filesystem layout (whether the file blocks are physically sequential), etc. NOTE: the kernel module needs to be reloaded before every experiment that uses it; this can be done with "make reload". 11. Another experiment, for alternate-block access in Section 4.1. $ make reload; BENCH="+ " bench seqreadskip -x1 0 & BENCH="* " bench seqreadskip -x1 1 & (read); killall bench Run this with "#define NWCS" turned on and off in config.h, so that the kernel reports throughput and utilization in both cases. We observe: Antsched disabled: -----! snitch !----- avg 5049 4920 997 Antsched enabled: -----! snitch !----- avg 10682 10005 934 Throughputs of 5.049 and 10.683 MB/s agree with the pair of alter/read bars in Figure 4. 12. Proportional-share scheduling. (Microbenchmark version of Section 4.7) Enable STRIDE in config.h, and retain SPTF and ASPTF_BOUND at their current values. Keep NRC defined to 2, and PROP_RELAX_THRESH to 100. xrc 1 cmd args..: binds "cmd" to the 1st resource container. Default is container 0. These containers have allocations of 1:2, in terms of utilization (not throughput). So we use dmesg | tail -20 to see the kernel's utilization measurements, and check whether the ratio between achieved allocations of the two processes is indeed 2. First with #define NWCS commented out (i.e. antsched disabled): make re; BENCH="+ " bench seqreadmap 0 & BENCH="* " xrc 1 bench seqreadmap 1 & (read); killall bench dmesg | tail -20 .. a 33 2688 2496 506 b 66 2688 2432 492 total 5376 4928 997 2 .. Clearly, 506:492 != 1:2 by far. But with anticipatory scheduling enabled, we see .. a 33 4672 3712 301 b 66 9792 9152 665 total 14464 12864 966 14 .. Thus achieving desired proportions of 301:665, i.e. close to 1:2. In this case, *cumulative* proportions are almost exactly 1:2. Other microbenchmarks are equally easy to run; "bench" without arguments lists out a long set of suitable command-line options. MACROBENCHMARKS Various macrobenchmarks need some nontrivial setting up; please write to me at ssiyer@cs.rice.edu and I'll gladly mail you the details. This includes: * My TPC-B-variant implementation * Apache setup, with Berkeley and Rice webserver traces * Scripts to automate experiments, generate plots, etc. * Perl code that draw bargraphs directly in postscript. CODE LAYOUT * sys kernel source kernel.patch patch to the kernel source, provides scheduler hooks. In particular, search for ufs_disksubr to see how enqueue() and dequeue() are routed through the module. There are also small changes to the IDE and SCSI drivers. This patch is an overkill; there are ways to get away with far fewer changes. (e.g. all the IODONEs can be combined). * sched.c implements the anticipatory scheduling core (mainly in _hasany, dequeue and timeout_expire) and the kernel module stuff. NOTE: each process is associated with a structure, allocated at fork and destroyed at exit. Search for pinfo. * config.h is a bunch of #defines * _sptf.h,_cscan.h SPTF, Aged-SPTF and CSCAN schedulers seek.h wrapper around seek reducing schedulers seek_nwcs.h antsched heuristic for seek reducing schedulers prop.h implements Stride, YFQ and Lottery proportional schedulers prop_nwcs.h antsched heuristics for proportional and combined schedulers * apic.h,apic.s,pit.h individual timer mecanisms timer.h wrapper to implement a high-level timer interface * postime/* measures and estimates the positioning time curve LICENSE This code release is NOT under the GPL or its variants. We are licensing this code free of charge to individuals for non-commercial personal use, and to non-profit entities for non-commercial use. Permission is also granted to commercial enterprises to use this software for evaluation purposes only. Any other use requires obtaining a license from the Rice University Office of Technology Transfer (techtran@rice.edu). This code is copyrighted software and can NOT be redistributed. Please see the copyright notice for more details regarding the license terms and copyrights. (file: LICENSE) REFERENCES [1] Sitaram Iyer and Peter Druschel. Anticipatory scheduling: A disk scheduling framework to overcome deceptive idleness in synchronous I/O. In proceedings of the 18th Symposium on Operating Systems Principles, October 2001, Chateau Lake Louise, Banff, Canada. http://www.cs.rice.edu/~ssiyer/r/antsched/antsched.pdf [2] Sitaram Iyer and Peter Druschel. The effect of deceptive idleness on disk schedulers. Technical report CSTR01-379, Rice University, June 2001. http://cs-tr.cs.rice.edu/doc/TR01-379/TR01-379.pdf AND FINALLY Do drop us a note if you find any of this stuff (including concepts from the paper) useful in any interesting way. -- Written by Sitaram Iyer , 23 August 2001.