Soft timers
Department of Computer Science
Rice University
- Author: Mohit Aron
Email: aron@cs.rice.edu
- Code release version: 1.0
- License: This code release is NOT under the GPL or its
variants. We are licensing this code free of charge to individuals for
non-commercial personal use and to non-profit entities for
non-commercial use. Permission is also granted to commercial enterprises
to use this software for evaluation purposes only. Any other use
requires obtaining a license from the Rice University Office of
Technology Transfer (
techtran@rice.edu). This code is copyrighted software and can NOT
be redistributed. Please see the copyright
notice for more details regarding the license terms and copyrights.
- Downloading: The code can be downloaded from this
URL. To unpack the code, the following instructions can be
used on Unix machines:
gunzip -c Soft-timers-1.0.tgz | tar xvf -
- Supported platforms:
This code only supports the FreeBSD-2.2.6 operating system running on
the x86 hardware. The gcc-2.7.2.1 compiler was used for compiling the
code on this platform. The source code for the freely available
FreeBSD-2.2.6 operating system can be obtained from the FreeBSD CVS
repository. For more information, refer the FreeBSD Handbook available
from http://www.freebsd.org/.
- Contents:
Introduction
Design
Loadable Module Implementation
Code layout
Configuring
Compiling and Installing
Running
Caveats
Bibliography
This code implements Soft timers, a new operating system facility that allows
the efficient scheduling of software events at a granularity down to tens of
microseconds. Soft timers can be used to avoid interrupts and reduce context
switches associated with network processing without sacrificing low
communication delays. This code also implements two applications of Soft
timers, rate-based pacing of packet transmissions and network polling for
packet receptions.
This code was developed as part of the ScalaServer project
under the direction of Prof. Peter
Druschel at the Department of Computer Science, Rice University.
Along with Soft timers, this code also contains implementations of two hardware
interrupt based mechanisms for obtaining fine-grained events. The first uses
the external 8253 programmable interrupt timer (PIT) chip to generate
interrupts at the required frequency. The second uses the on-chip timers
available on APICs (asynchronous programmable interrupt controller), which only
appeared first in the Intel Pentium III family of the x86 chips.
This document briefly describes the overall design and layout of the code as
well as how to compile, install and run it. It does not comprehensively
describe all the code details and is only intended to be a guide to help give a
quick start to a knowledgeable Systems person. Implementation details have to
be gathered by reading the code. This document further assumes that the reader
is familiar with the papers given in the bibliography.
The code uses a timing-wheel based data structure for the scheduling of
software events. This data structure presents an interface to the rest of
the system for scheduling/canceling software events and to check and fire
any scheduled events.
For fine-grained specification of time, the time is maintained as a 64-bit
variable maintained in cycles of the processor's clock speed. Time measurement
is done using the 64-bit cycle counter available on modern x86 hardware. This
counter can be read with very little overhead (almost as low as reading a
register).
The principal challenge in efficiently supporting event scheduling at a
fine-granularity is to check the timing wheel data structure for pending events
at a high frequency and with low overhead. This code provides four different
methods of checking for pending events.
- 8253 PIT based periodic hardware interrupts: This method
programs the external 8253 chip to deliver interrupts at the required
frequency. While the 8253 can support high interrupt frequencies, the
overhead of each interrupt is prohibitively high. As explained in
[1,2], interrupts also
result in bad cache locality.
- APIC timer based periodic hardware interrupts: This method
programs the on-chip APIC to deliver interrupts at the desired
frequency. These interrupts are less expensive than the 8253 based
hardware interrupts [1]. However, as with 8253
interrupts, APIC interrupts also adversely affect cache locality. As APICs
only appeared first on the Pentium III, this method is not available on
processors prior to Pentium III in the x86 family.
- Soft timers: With Soft timers, any pending events are checked
from strategic places in the kernel code using procedure calls. In most
systems, these trigger states happen very frequently and
therefore Soft timers support fine granularity with low overhead (overhead
of procedure calls). As explained in [1,2], Soft timers also afford better cache locality than
hardware interrupt based mechanisms. However, as there are no guarantees
that trigger states shall be reached within any given amount of time, this
method cannot be used in situations where real-time constraints are to be
met.
- Soft timers backed by APIC timers: This approach combines the low
overhead of Soft timers with the ability of interrupt based mechanisms to
provide real-time guarantees [1]. This method
requires two timing parameters for scheduling an event - the time
T at which the event is supposed to fire, and a tolerance
tol that specifies the maximum tolerable delay for the firing of the
event. In other words, T+tol gives the deadline for the firing
of the event. For this purpose, an APIC interrupt is scheduled to occur at
time T+tol. If a Soft timer trigger state occurs between times
T and T+tol , the event is fired using that and the
APIC interrupt is canceled; otherwise the event is fired in the context
of the APIC interrupt when the deadline is reached. Scheduling and
canceling an APIC interrupt has low overhead (writing a register). If
trigger states happen at sufficiently high frequency, the APIC interrupt
would be canceled most of the time and the high overhead of actually
fielding an APIC interrupt won't be incurred. For implementing this
method, a second timing wheel is used to maintain the deadlines for
events.
Also included in the code are two applications in networking that need
fine-grained scheduling of events. These are rate-based pacing and
network polling.
Rate-based pacing is applied in the code to TCP connections such that
any two packets on a connection are separated by a minimum time interval.
This can be used to (1) prevent packet bursts in TCP, and (2) fully utilize
available bandwidth in high bandwidth-delay networks if the bandwidth is
known a priori . Techniques for measuring bandwidth (such as
packet-pair algorithm) are not implemented in this code.
Network polling is used to avoid interrupts generated when packets are received
on a network interface. These interrupts have high overhead and adversely
affect cache locality. With network polling, these interface interrupts are
turned completely off and periodically the network interface is checked (or
polled) to see if any packet has arrived. The code does a dynamic computation
of the poll interval in order to receive a target number of packets on the
interface at any poll. In order to poll the interface, the fine-grained event
support in the code is used.
In order to facilitate the development and debugging of the code, the kernel
parts of the code are implemented as a loadable module. Few code changes
are made to the actual FreeBSD-2.2.6 kernel - most of these changes are
necessary only to hook the loadable module into the kernel.
Since this project required changes to the network stack for implementing
rate-based pacing and network polling, the loadable module installs an
additional network stack in the operating system rather than modifying the
existing one. In other words, once the module is loaded, the operating system
runs with two network stacks. The original stack is used for regular
networking services like telnet, rlogin etc. Any bugs (unless they cause memory
violations) in the new stack thus still keep regular system services
running. All it takes to fix these bugs is to unload the module, recompile the
fixed code and load the module back.
The network stack implemented by the loadable module was derived from the
original FreeBSD-2.2.6 stack itself. The trick to installing additional network
stacks is to install them as a different protocol family. Application programs
(such as web servers) choose the network stack they want to use by specifying
the protocol family as the first argument to the socket() system call. In
FreeBSD, the protocol family for the regular TCP/IP stack is 2 (specified
with macro PF_INET or AF_INET). For the alternate stack implemented in the
loadable module, this protocol family is chosen to be 21. Therefore, in order
to use a conventional web server (e.g., Apache) with this loadable module,
the first argument of any socket() calls made in the server code should be
changed to 21. (In Apache-1.3.3 source this required a change in the file
src/main/http_main.c).
Once the kernel operates with both the stacks loaded, the next question that
arises is which stack should service an IP packet that is received at a network
interface. The solution to this is of necessity ad hoc, as both network
stacks are equally capable of handling all IP packets. The packet is first
presented to the stack in the loadable module. This stack quickly makes a
decision on whether it wants the packet or not - if not, then the packet is
sent up the original stack. Under this implementation, all packets containing a
source IP address of the form 192.x.x.x and arriving on any of the fxp0-fxp4
interfaces are accepted by the stack in the loadable module. However, packets
from source 192.168.2.75 and packets from all other sources are sent up the
original stack (192.168.2.75 was the IP address of our development machine and
we wanted to use the regular stack for this address for ftp'ing the compiled
code over to the experimental machine). This part of the code is implemented in
FreeBSD-stack/netinet/netsys.c and should be modified as necessary.
This section provides a brief description of most of the files and
directories included in this release.
- distdir/: After compilation, this
directory contains the files that should be copied over to the
experimental machine on which the code is to be run. These are a
Makefile and the loadable module object file netsys_mod.o
.
- FreeBSD-stack/: Directory that contains code for compiling
the loadable kernel module. This code implements the alternate
network stack that contains the rate-based pacing and network
polling applications. The support for fine-grained event scheduling
is compiled into this kernel module also.
- FreeBSD-stack/usr.bin/symorder: This directory contains the
modified source for the symorder program found in FreeBSD.
This program is used for localizing symbol definitions in the
loadable module so that they do not conflict with similar definitions
in the original network stack when the module is loaded. This enables
an alternate network stack to be derived from the original stack
without changing all the function names. The original FreeBSD
symorder does not localize global BSS symbols; the code in this
directory provides a fix to work around that. Executing a 'make'
in this directory would compile the symorder program. This should be
done before trying a 'make' in the FreeBSD-stack/compile directory.
- FreeBSD-stack/kpatch/sys/: This directory contains the set of
modified files in the FreeBSD-2.2.6 kernel code. Some of these files
are necessary for the loadable kernel module to interact with the
kernel, while others fix bugs and/or better optimize the kernel. If
the kernel source is present in /sys, then file
FreeBSD-stack/kpatch/sys/X/Y should replace file /sys/X/Y in the
kernel code. To quickly update the kernel source in /sys, the
command ' cp -r FreeBSD-stack/kpatch/sys /' can be used. The
files relative to this directory are described next (most of these
files are modified from corresponding files in the kernel source).
- i386/conf/PD0X: The kernel configuration file using which
we compiled our kernel. This should be modified as necessary to suit
a different hardware configuration than that on which this code was
tested.
- i386/i386/perfmon.c: Modified to fix a bug in the
performance monitoring device driver provided in FreeBSD-2.2.6.
- i386/i386/swtch.s, i386/i386/vm_machdep.c: Modified to
make the idle loop a Soft timer trigger state.
- i386/i386/trap.c: Modified to make traps and system calls
Soft timer trigger states.
- i386/include/vmparam.h: Modified to increase the max
memory available for kernel data structures from 32 MB to 64 MB. This
was required for some of the experiments presented in
[1,2].
- i386/isa/clock.c: Modified to support the use of 8253
chip to interrupt at high frequencies.
- i386/isa/icu.s: Modified to Make the assembly function
_doreti global. This function provides common code executed after
interrupt handling. This code was used by our APIC interrupt handler.
- kern/kern_malloc.c: Modified to instrument the kernel to
record the maximum memory every needed for internal kernel data
structures.
- net/if_ethersubr.c, net/if_loop.c: Modified to add hooks
in the kernel to enable the use of the alternate network stack in the
loadable module.
- net/if.h: Modified to increase the size of the IP
interface queue length from 50 to 1000 packets.
- netinet/tcp_subr.c: Modified to turn rfc1644 support
off. This is to disable T/TCP support that interferes in the
experimental setup used in
[1,2].
- nfs/nfs_node.c: Modified to fix a bug in the nfs code
(later fixed in FreeBSD-2.2.8).
- pci/if_fxp.c: Modified to support the polling code in
loadable kernel module.
- scsi/scsi_base.c: Modified to make disk interrupts a
Soft timer trigger state.
- sys/socket.h: Modified to increase the maximum length of
the listen queue from 128 to 1024 connections.
- sys/socketvar.h: Modified to increase the maximum
length of socket buffers from 256 KB to 512 KB.
- sys/soft_timers.h: Supports the system call implemented
by the loadable kernel module to get statistics about Soft timers.
- FreeBSD-stack/netinet/: This directory contains the source for
the loadable kernel module. Most of this source was derived from the
source for the network stack found in a regular FreeBSD-2.2.6 kernel
in the directory /sys/netinet. The loadable kernel module is
compiled by executing a 'make' in this directory. The files
in this directory are briefly described next.
- netsys.c, netsys.h : Contain (1) boiler-plate code
for the loadable kernel module, (2) code to initialize
a new protocol family in the kernel when module is loaded,
(3) the function sched_softintr() in netsys.c that inspects
every IP packet received on any network interface to decide
whether it is to be sent up the original network stack or
the stack in the loadable module, and (4) support for
enabling the different methods mentioned above that support
fine-grained event scheduling, and (5) support for enabling
rate-based pacing and network polling.
- soft_tw.c, soft_tw.h: Implement the timing wheel
for scheduling events. Also implement the second timing wheel
used when APIC timers are used in conjunction with Soft timers
to provide real-time deadlines.
- soft_poll.c : Implements network polling.
- tcp_output.c : Modified to implement rate-based pacing
of packet transmissions.
- apic.s : Implements the APIC interrupt handler.
FreeBSD-2.2.6 does not provide a default handler for APICs as
it does not support APICs by default.
- Most other files are copies of the corresponding files from the
kernel source in /sys/netinet. Most are unchanged or slightly
changed to either incorporate them in the alternate network
stack or to add Soft timer trigger states.
This section describes the various parameters used in the source for the
loadable kernel module in order to configure it.
- The macro tsc_freq in FreeBSD-stack/netinet/netsys.h specifies
the processor clock frequency in MHz. It should be changed in accordance
with the processor speed of the experimental machine being used.
- Configuring the 8253 PIT timer : The 8253 PIT timer can be
enabled/disabled by setting the variable enable_pit to 1/0 in the
file FreeBSD-stack/netinet/netsys.c. The interrupt frequency is specified
by setting the variable pit_freq in the same file to a value in
Hz (the default used in the source is 100000 Hz).
- Configuring Soft timers : Soft timers can be enabled/disabled
by setting the variable enable_soft_timer to 1/0 in the file
FreeBSD-stack/netinet/netsys.c.
- Configuring APIC timer : The APIC timer is configured by setting
the variable enable_apic_timer in
FreeBSD-stack/netinet/netsys.c. To disable the APIC timer, this variable
should be set to APIC_DISABLE. To configure the APIC timer to
periodically check the set of pending events at a certain frequency, the
enable_apic_timer variable should be set to APIC_ALONE
and the variable apic_tm_intvl should be set to the desired
interrupt time interval (in microseconds). The default value is configured
such that an interrupt is generated every 10 microseconds. To configure
the APIC timer to be used in conjunction with Soft timers to provide
real time deadlines, the value of enable_apic_timer should be
set to APIC_ST, and Soft timers should be enabled as given above.
- Configuring rate-based pacing : Rate-based pacing is enabled
by setting the variable min_pkt_sep in
FreeBSD-stack/netinet/netsys.c to a positive value. This value specifies
the minimum time interval between the transmission of two packets on a TCP
connection. The value is given in cycles of the processor clock. For
example, setting the variable to the value (1*tsc_freq) would
ensure a minimum of 1 microsecond spacing between any two packets sent
on a connection. Rate-based pacing is disabled by setting min_pkt_sep
to the value 0. When APIC timers are used in conjunction with
Soft timers to provide real-time deadlines, the variable tcp_tol
in FreeBSD-stack/netinet/soft_tw.c specifies the maximum tolerated delay
in scheduled packet transmissions.
- Configuring network polling : Network polling is enabled/disabled
by setting the variable enable_polling in
FreeBSD-stack/netinet/netsys.c to 1/0. This code does not implement a
parameter that specifies a deadline for polling events when APIC timers
are used with Soft timers. However, this can be very easily added by
modifying FreeBSD-stack/netinet/soft_poll.c.
Download the source from the URL given at the beginning of this document.
Unzip and untar the tarball.
Follow the following steps for testing the system.
- Reboot experimental machine : Reboot the experimental
machine such that it runs the modified kernel from the
Compiling and Installing section.
- Load kernel module : On the experimental machine, go to
directory where the two files from FreeBSD-stack/distdir were
copied. Execute 'make load' in this directory with super-user
privileges. This loads the kernel module and installs the second
stack in the kernel.
- Start the application : On the experimental machine,
start the user application that uses TCP/IP networking (e.g., a web
server). This application should have been compiled to use the
network stack in the loadable module. If a web server is used as the
application, it is now ready to receive HTTP requests from any clients.
For terminating the setup, execute the following steps:
- Stop the application running on the experimental machine.
- Unload the module by executing the command ' make unload '.
- The parameter enable_apic_timer should be set to
APIC_DISABLE when the code is used on processors that
do not contain an APIC.
- The code that uses the APIC timers is tailored to a 500 MHz Pentium
III. On other processors that support the APIC, the apic_bus_freq
variable in FreeBSD-stack/netinet/netsys.c has to be
changed appropriately. The value of this variable for a 500 MHz
Pentium III is 101 MHz.
- This is an alpha release. The code might be unstable under various
experimental conditions and might result in system crashes.
-
"Soft timers: efficient microsecond software timer support for
network processing", Mohit Aron and Peter Druschel.
In ACM Transactions on Computer Systems (TOCS) ,
August 2000.
-
"Soft timers: efficient microsecond software timer support for network
processing" , Mohit Aron and Peter Druschel. In Proceedings
of the 17th Symposium on Operating Systems Principles (SOSP-17) ,
Kiawah Island, SC, December 1999.