Resource containers and LRP

Department of Computer Science
Rice University

Introduction

This code implements Resource containers along with Lazy Receiver Processing (LRP). Resource containers [2,3] are an operating system abstraction for resource principals distinct from the notion of a protection domain. They enable fine-grained resource management in operating systems. LRP [2,4] is a network subsystem architecture that integrates network processing with resource management in an operating system. The resources spent in network processing are correctly accounted to the resource principal on whose behalf this processing is performed.

Both Resource containers and LRP were originally implemented in Digital UNIX by Gaurav Banga in the context of the ScalaServer project. This code was developed from ground up for the FreeBSD operating system by the author in context of a project on resource management in cluster-based servers [1] under the direction of Prof. Peter Druschel and Prof. Willy Zwaenepoel at the Department of Computer Science, Rice University.

This document briefly describes the overall design and layout of the code as well as how to compile, install and run it. It does not comprehensively describe all the code details and is only intended to be a guide to help give a quick start to a knowledgeable Systems person. Implementation details have to be gathered by reading the code. This document further assumes that the reader is familiar with the publications [2], [3], [4] and [5].

Design

Resource containers replace processes in FreeBSD as the resource principals. Consumption of resources is charged to Resource containers rather than processes (FreeBSD does not support kernel threads). Similarly, scheduling of resources happens between Resource containers rather than processes.

Resource containers are arranged hierarchically in a tree-like data structure for the purpose of scheduling. Each node in this tree is a Resource container and the tree is headed by a special root Resource container. Any internal node in the tree is different from the leaf nodes in that each internal node runs a scheduler for multiplexing resources to its children Resource containers. Different scheduling algorithms might be used by distinct internal nodes. Processes are associated (or resource bound ) only to leaf containers and these containers only serve to pass the resources handed to them by their parent containers to the associated processes.

To facilitate the development and debugging of the code, Resource containers are implemented in a loadable kernel module. This module is designed to replace the process-centric resource management in FreeBSD with Resource containers once the module is loaded. Unloading the module restores the FreeBSD resource management.

Once the Resource container kernel module is loaded, a set of system calls (described in detail in [2]) become available to applications for the purpose of using Resource containers. A brief description of these system calls is provided later in this document. However, the code is designed such that existing applications do not have to be modified; that is, it is not necessary for an application to use the system calls provided with Resource containers.

The Resource container code starts by resource binding every application process in the system to a distinct leaf container that is made a child of the root container. Therefore, upon immediately loading the module, the scheduling algorithm at the root Resource container determines the manner in which resources are multiplexed between processes. Application processes can then use the Resource container system calls to add more complexity to the hierarchy. The Resource containers are kept transparent to existing applications by automatically managing the resource binding of new processes so as to maintain the usual UNIX semantics. A newly created process is resource bound to a new leaf container that becomes a sibling of the leaf container associated to the process's parent.

In order to support event driven processes/threads that perform processing on behalf of several Resource containers, the code also supports the concept of a scheduler binding [3]. A process/thread can scheduler bind itself to multiple Resource containers. The process/thread then becomes eligible to receive resources allocated to any of these containers. However, processing performed by the process/thread is accounted towards the unique Resource container to which it is resource bound.

LRP is also implemented in a separate loadable kernel module and serves to correctly account for network processing. As for processes, network sockets are also resource bound to Resource containers (a network socket is bound to the same leaf container as the process that created it). Early demultiplexing is performed on network packets to identify the network sockets to which they belong. The packets are then placed in socket specific queues and further processing is performed according to the scheduling priority of the Resource container that is associated with the network socket. All network processing is actually performed by a special process that is started during LRP initialization and remains kernel resident for as long as the LRP module is loaded. When network processing is to be performed for some socket, this special process first resource binds itself to the socket's associated container and then performs network processing for that socket. Due to its resource binding, the resource usage is charged to the correct container.

To facilitate debugging, the LRP module implements a complete network stack rather than requiring obtrusive changes to the default FreeBSD-4.0 network stack. This is described in more detail in the next section.

Network Stack in LRP Loadable Module

The philosophy behind the design of both the Resource container as well as the LRP modules was to minimize the changes to the FreeBSD-4.0 kernel and to implement most of the code in the modules themselves. However, since LRP alters the network stack, the LRP loadable module installs an additional network stack in the operating system rather than modifying the existing one. In other words, once the LRP module is loaded, the operating system runs with two network stacks. The original stack is used for regular networking services like telnet, rlogin etc. Any bugs (unless they cause memory violations) in the new stack thus still keep regular system services running. All it takes to fix these bugs is to unload the module, recompile the fixed code and load the module back.

The network stack implemented by the loadable module was derived from the original FreeBSD-4.0 stack itself. The trick to installing additional network stacks is to install them as a different protocol family. Application programs (such as web servers) choose the network stack they want to use by specifying the protocol family as the first argument to the socket() system call. In FreeBSD, the protocol family for the regular TCP/IP stack is 2 (specified with macro PF_INET or AF_INET). For the alternate stack implemented in the loadable module, this protocol family is chosen to be 21. Therefore, in order to use a conventional web server (e.g., Apache) with this loadable module, the first argument of any socket() calls made in the server code should be changed to 21. (In Apache-1.3.3 source this required a change in the file src/main/http_main.c).

Once the kernel operates with both the stacks loaded, the next question that arises is which stack should service an IP packet that is received at a network interface. The solution to this is of necessity ad hoc, as both network stacks are equally capable of handling all IP packets. The packet is first presented to the stack in the loadable module. This stack quickly makes a decision on whether it wants the packet or not - if not, then the packet is sent up the original stack. Under this implementation, all packets containing a source IP address of the form 192.x.x.x and arriving on any of the fxp0-fxp4 interfaces are accepted by the stack in the loadable module. However, packets from source 192.168.2.75 and packets from all other sources are sent up the original stack (192.168.2.75 was the IP address of our development machine and we wanted to use the regular stack for this address for ftp'ing the compiled code over to the experimental machine). This part of the code is implemented in LRP/netsys.c and should be modified as necessary.

Code layout

This section provides a brief description of most of the files and directories included in this release.

Resource container system calls

Handles to Resource containers are made available to user-applications in their file descriptor space (same as for sockets). Therefore, all the FreeBSD system calls that can be used on descriptors for files and sockets can also be used for descriptors for Resource containers. However, most of these will return an error when applied to a Resource container descriptor with the errno set to EINVAL. In addition, a set of new system calls have been provided for Resource containers. Any application that uses these system calls must include the file RC/rc.h before compilation. We now briefly describe the set of system calls that can used on Resource container descriptors. Unless otherwise mentioned, all calls return a 0 on success and a -1 on failure.

Compiling and Installing

Download the source from the URL given at the beginning of this document. Unzip and untar the tarball.

Running

Follow the following steps for testing the system.
  1. Reboot experimental machine : Reboot the experimental machine such that it runs the modified kernel from the Compiling and Installing section.

  2. Load kernel modules : On the experimental machine, go to directory where the four files from the build directory were copied. Execute 'make load' in this directory with super-user privileges. This loads the Resource container and LRP kernel modules (installing the second network stack in the kernel). In addition, it also starts LRP_startup. If only Resource container support is desired, execute 'make RC' rather than 'make load'.

  3. Start applications : Start any applications. As mentioned earlier, networking applications can only use LRP if they have been compiled to use the network stack in the LRP loadable module.
For terminating the setup, execute the following step:
  1. Unload the modules by executing the command 'make unload'.
Running programs in test directory : Once the Resource containers and LRP modules are loaded, the sample programs in the test directory can be run. They can be compiled by executing the 'make' command in the test directory. The executables should be copied over to the experimental machine. Here is an example execution:
  1. auxprog1 40 &
    Creates RC1, attaches it to root container, sets reservation of 40% on it and goes to sleep.
  2. prog1 10 &
    Makes leaf container of prog1 a child of RC1 and gives it a reservation of 10%. As prog1 is the only active process on CPU, the 'top' command will show 100% of the CPU resources allocated to prog1.
  3. prog1 20 &
    Attaches another prog1 container to RC1 but assigns a 20% reservation to it. The two active prog1 processes would get CPU resources in the proportion 1:2 now.
  4. prog2 40 &
    Assigns reservation of 40% to leaf container of prog2 that is a child of the root container. As a result, the root container will distribute resources between RC1 and container of prog2 in the ration 1:1 (that is prog2 shall get 50% CPU resources). The 50% resources of RC1 shall be divided in the ratio 1:2 between the two active prog1 processes. Therefore, these will get 16% and 32% of the CPU resources.
  5. Note that it takes a while before the values of CPU utilizations reported by 'top' stabilize to the above values. This is because, the past CPU usage is decayed with a factor of 0.95 in FreeBSD - this means that it takes about 45 seconds to forget 90% of the past usage. For terminating the sample programs, all instances of prog1 should be killed before auxprog1 (see Caveats).

Caveats

Bibliography

  1. "Cluster Reserves: A Mechanism for Resource Management in Cluster-based Network Servers", Mohit Aron, Peter Druschel and Willy Zwaenepoel. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems , Santa Clara, CA, June 2000.

  2. "Operating System Support for Server Applications" , Gaurav Banga, PhD thesis, Computer Science, Rice University, May 1999.

  3. "Resource containers: A new facility for resource management in server systems" , Gaurav Banga, Jeffrey C. Mogul and Peter Druschel. In Proceedings of the Third Symposium on Operating Systems Design and Implementation (OSDI), New Orleans, LA, February 1999.

  4. "Lazy Receiver Processing (LRP): A Network Subsystem Architecture for Server Systems" , Peter Druschel and Gaurav Banga. In Proceedings of the Second Symposium on Operating Systems Design and Implementation (OSDI), Seattle, WA, October 1996.

  5. "Lottery Scheduling: Flexible Proportional-Share Resource Management" , Carl A. Waldspurger and William E. Weihl. In Proceedings of the First Symposium on Operating Systems Design and Implementation (OSDI), Monterey, CA, November 1994.