Enhancements to VM Heavy Load Behavior

François Barbou des Places ([email protected])

Open Software Foundation
Research Institute
2, Avenue de Vignate
38619 Gières - France

This document describes some enhancements to the Virtual Memory subsystem of the OSF Mach kernel. They were adapted from the OSF/1 integrated kernel and improve the behavior of the system under heavy load.

1.0 Goals

With a separate task for the default pager, message-based memory manager and device interfaces, the cost of pagination is much heavier for Mach than for traditional integrated kernels. This structural cost could probably be slightly reduced, but will still be higher than for other systems because of its modularity and extensibility. These design features make Mach-based systems more sensible to performance degradation under heavy load.

The goal of this project is to improve the behavior of the system under heavy load, through the implementation of mechanisms that regulate the paging activity and therefore reduce thrashing.

This problem has been addressed for the integrated kernel and clustered paging and task swapping were implemented in OSF/1 release 1.1 and 1.2. The Research Institute decided to adapt this work to the microkernel VM subsystem.

2.0 Clustered Paging

The goal is to reduce the cost of paging by grouping the EMMI messages and disk I/Os. Instead of paging out one page at a time (or requesting one page at a time from the memory manager), the kernel sends out a group of pages (or requests a group of pages). If the memory manager can handle it, it can write (or read) the entire cluster of pages to (or from) the disk

To make the memory manager's life easier, a cluster will always refer to contiguous pages. The memory manager sets the maximum cluster size itself, during the initialization of the memory object. For more efficient I/O management, the kernel will also request or return clusters aligned on a cluster size boundary (i.e. the memory manager will never need to split the request because it refers to several clusters on disk). However, the kernel might request or return a cluster smaller than the given object cluster size (if some pages are already present or already absent).

2.1 Clustered Pagein

Typically, pageins are always performed on demand, as the result of a fault on a page. It is possible to reduce the overhead of page faults by pre-fetching pages. When a page fault occurs, the kernel will determine which absent pages are likely to be accessed soon and request these pages along with the one that caused the fault.

Since the kernel needs to know what the user task is about to do, a new Mach service was added to provide more information about the access strategy of memory regions. The user can explicitly inform the microkernel that it will access a given memory range sequentially, reverse-sequentially or randomly. For a sequentially accessed region, the kernel will try and pre-fetch some pages ahead of the faulting page. For a reverse-sequentially accessed region, it will pre-fetch behind. For randomly accessed regions, it will not do any pre-fetching at all. The default (when the user does not provide any information for a region) is to pre-fetch one page ahead.

Measurements showed than 55% of the pre-fetched pages are actually used, on a typical work-load without using any explicit access strategy. Therefore, it doesn't improve performance in the default case, but can be a big help for applications which access big arrays sequentially for example.

2.2 Clustered Pageout

When it is short of physical memory, the kernel selects pages to return to their memory managers. This selection is done on a LRU (Least Recently Used) basis. Under heavy load, some tasks are likely to be off the cpu for a possibly long time, while others are trying to run. Therefore, some entire memory regions will reach the LRU list top at the same time and can be paged out as a cluster.

In the current implementation, the LRU criteria applies only to the initial page to be sent back. A cluster is then built around this page and sent to the memory manager. It could be useful to refine this strategy to avoid paging out frequently used pages accidently. The cluster could be limited to pages that have not been accessed in some time (to be defined), or it could be kept as is if the task has been swapped out (see below).

2.3 Default Pager Clustering

The default pager was modified to provide cluster sizes when initializing memory objects, to accept memory_object_data_request and memory_object_data_return messages concerning more than one page, and to optimize disk I/O by reading and writing entire clusters.

The default cluster size is 4 pages, but can be changed to any power of 2 between 1 and 8 (this is the current maximum handled by the kernel, although it is just a matter of changing a constant to increase this maximum). With the OSF/1 server, the cluster size for the default pager can be set using the ``clsize=xx'' for the swap partitions in /etc/fstab.

The default pager disk management was entirely modified and is now an adaptation of the OSF/1 vnode pager disk management. The tables used to find a page on disk now refer to clusters only and a small bitmap is managed per cluster to reflect which pages are used and/or on disk for a given cluster.

The management unit for the backing store partitions is now the cluster. This can lead to some disk space waste for objects whose size in pages is not a multiple of the cluster size.

2.4 Vnode Pager Clustering

The OSF/1 vnode pager is used only for permanent objects backed by real files on a regular file system. It relies on the VFS (Virtual File System) for I/Os, so this part did not need to be changed. It just had to be adapted to handle requests and returns of clusters instead of pages and to initiate the I/O for the entire cluster.

3.0 Task Swapper

The system tries to share CPU time between all the runnable tasks. This can lead to thrashing under heavy load. Thrashing occurs when the thread which has been selected to run spends its time quantum requesting pages that have been paged out while other tasks were running. The task swapper is an attempt to avoid thrashing by limiting the load of the system. It does so by freezing some tasks to let others run in better conditions.

Task swapping is triggered when the paging activity exceeds a given level. The kernel selects a task that is eligible for swap-out and suspends it. Once the tasks is suspended, its memory should rapidly reach the top of the LRU list and get paged out via the usual mechanism (and taking advantage of cluster pageout).

A swapped-out task is swapped back in either when one of its threads is explicitly woken up, or when the paging activity is down to a reasonable level. Starvation is avoided because if one or more tasks maintain a pagination rate that prevents other tasks from being swapped back in, the task swapper will swap them out too. The first task to have been swapped out is the first to be swapped back in.

Further improvements were made at the RI to smooth the swapping activity and to try and compute dynamically the optimal paging rate at which task swapping should be triggered. This is still pretty basic, but finding the exact value might require very accurate instrumentation and possibly some tools closer to artificial intelligence than operating systems technology...

4.0 Results on the Intel platform

The improvements have been integrated in the microkernel based OSF/1 1.3 release for the Intel platform. We ran the same benchmarks on different UNIX systems available on AT386 platforms: OSF/1 1.3 micro-kernel (MK), OSF/1 1.3 integrated kernel (IK), Novell UnixWare v3.11 and Linux v1.1.75. The hardware is a i486 at 33 MHz with 16 megabytes of RAM.

Note that the OSF/1 1.3 system does not include recent performance improvements that were made in the MK6 system (collocation, short-circuited RPC, etc...). This explains why the OSF/1 MK results are still significantly below the OSF/1 IK results. With the MK6 improvements, both systems have similar performance under normal load. In any case, the VM enhancements we have described address only the behavior under heavy load, so only the general shape of the performance curves is of interest here.

4.1 Heavy load performance

Performance under heavy load was measured using the NONAIM benchmark which is an adaptation of the AIM III benchmark. The difference is that NONAIM uses fork() and exec() to start the individual load programs while with AIM III, all the load programs are grouped in a single binary and only fork() is invoked. We believe that NONAIM is a better emulation of a real workload.

Unixware stopped running after 150 users because it ran out of processes. According to the documentation, it is not possible to increase this number further. Linux paniced at 400 users because of a swap-related structure corruption. The OSF/1 1.3 IK hung at 200 users.

The OSF/1 1.3 MK started thrashing at 500 users because there was less than 1 megabyte of non-wired memory left available for applications. This is a good indication of the robustness of the system.

We conclude from this graph that OSF/1 MK now shows a reasonable degradation with increasing load and is able to maintain a correct throughput even under extreme loads. The OSF/1 MK curve is still below the curves for other systems, but the MK6 enhancements will probably move it up close to the OSF/1 IK level.

4.2 Paging performance

Paging performance was measured using the ``sheeptest'' benchmark developed at the OSF. This benchmark starts 4 processes in parallel, each process allocates 4 megabytes of memory, touches every page and then scans them forward and backward. This is an almost pure paging benchmark, as very few system calls are invoked. We measure the elapsed time (in seconds on the graph) to complete the benchmark.

We have not tried to investigate the poor results for UnixWare as we have little knowledge of its internals. The excellent performance of Linux is probably due to its tightly integrated kernel (even the OSF/1 integrated kernel has a modular memory management layer, although it is mostly short-circuited for performance in the normal case), and to its small size, leaving more memory available for applications.

The most obvious way to improve paging performance further for OSF/1 is to redesign the external memory management interfaces to be more efficient and more suitable for other performance improvements (e.g. short-circuited RPC and collocation).

5.0 Port to the HP-PA MK6 kernel

The VM improvements are scheduled to be integrated in the MK6 microkernel and ported to the HP-PA. This system will probably perform even better because it can take advantage of collocation to allow the default pager to short-circuit the message interfaces to the devices. The EMMI (External Memory Manager Interfaces) will not be able to use short-circuiting in the short term because they are one-way messages, and only RPC-based interfaces can currently be short-circuited. Some work is being done at the RI to change the EMMI interfaces into RPCs.

Last Modified: 10:22am , April 12, 1996