Open Software Foundation
This document describes some enhancements to the Virtual Memory subsystem of the OSF Mach kernel. They were adapted from the OSF/1 integrated kernel and improve the behavior of the system under heavy load.
The goal of this project is to improve the behavior of the system under heavy load, through the implementation of mechanisms that regulate the paging activity and therefore reduce thrashing.
This problem has been addressed for the integrated kernel and clustered paging and task swapping were implemented in OSF/1 release 1.1 and 1.2. The Research Institute decided to adapt this work to the microkernel VM subsystem.
To make the memory manager's life easier, a cluster will always refer to contiguous pages. The memory manager sets the maximum cluster size itself, during the initialization of the memory object. For more efficient I/O management, the kernel will also request or return clusters aligned on a cluster size boundary (i.e. the memory manager will never need to split the request because it refers to several clusters on disk). However, the kernel might request or return a cluster smaller than the given object cluster size (if some pages are already present or already absent).
Since the kernel needs to know what the user task is about to do, a new Mach service was added to provide more information about the access strategy of memory regions. The user can explicitly inform the microkernel that it will access a given memory range sequentially, reverse-sequentially or randomly. For a sequentially accessed region, the kernel will try and pre-fetch some pages ahead of the faulting page. For a reverse-sequentially accessed region, it will pre-fetch behind. For randomly accessed regions, it will not do any pre-fetching at all. The default (when the user does not provide any information for a region) is to pre-fetch one page ahead.
Measurements showed than 55% of the pre-fetched pages are actually used, on a typical work-load without using any explicit access strategy. Therefore, it doesn't improve performance in the default case, but can be a big help for applications which access big arrays sequentially for example.
In the current implementation, the LRU criteria applies only to the initial page to be sent back. A cluster is then built around this page and sent to the memory manager. It could be useful to refine this strategy to avoid paging out frequently used pages accidently. The cluster could be limited to pages that have not been accessed in some time (to be defined), or it could be kept as is if the task has been swapped out (see below).
The default cluster size is 4 pages, but can be changed to any power of 2 between 1 and 8 (this is the current maximum handled by the kernel, although it is just a matter of changing a constant to increase this maximum). With the OSF/1 server, the cluster size for the default pager can be set using the ``clsize=xx'' for the swap partitions in /etc/fstab.
The default pager disk management was entirely modified and is now an adaptation of the OSF/1 vnode pager disk management. The tables used to find a page on disk now refer to clusters only and a small bitmap is managed per cluster to reflect which pages are used and/or on disk for a given cluster.
The management unit for the backing store partitions is now the cluster. This can lead to some disk space waste for objects whose size in pages is not a multiple of the cluster size.
Task swapping is triggered when the paging activity exceeds a given level. The kernel selects a task that is eligible for swap-out and suspends it. Once the tasks is suspended, its memory should rapidly reach the top of the LRU list and get paged out via the usual mechanism (and taking advantage of cluster pageout).
A swapped-out task is swapped back in either when one of its threads is explicitly woken up, or when the paging activity is down to a reasonable level. Starvation is avoided because if one or more tasks maintain a pagination rate that prevents other tasks from being swapped back in, the task swapper will swap them out too. The first task to have been swapped out is the first to be swapped back in.
Further improvements were made at the RI to smooth the swapping activity and to try and compute dynamically the optimal paging rate at which task swapping should be triggered. This is still pretty basic, but finding the exact value might require very accurate instrumentation and possibly some tools closer to artificial intelligence than operating systems technology...
Note that the OSF/1 1.3 system does not include recent performance improvements that were made in the MK6 system (collocation, short-circuited RPC, etc...). This explains why the OSF/1 MK results are still significantly below the OSF/1 IK results. With the MK6 improvements, both systems have similar performance under normal load. In any case, the VM enhancements we have described address only the behavior under heavy load, so only the general shape of the performance curves is of interest here.
Unixware stopped running after 150 users because it ran out of processes. According to the documentation, it is not possible to increase this number further. Linux paniced at 400 users because of a swap-related structure corruption. The OSF/1 1.3 IK hung at 200 users.
The OSF/1 1.3 MK started thrashing at 500 users because there was less than 1 megabyte of non-wired memory left available for applications. This is a good indication of the robustness of the system.
We conclude from this graph that OSF/1 MK now shows a reasonable degradation with increasing load and is able to maintain a correct throughput even under extreme loads. The OSF/1 MK curve is still below the curves for other systems, but the MK6 enhancements will probably move it up close to the OSF/1 IK level.
We have not tried to investigate the poor results for UnixWare as we have little knowledge of its internals. The excellent performance of Linux is probably due to its tightly integrated kernel (even the OSF/1 integrated kernel has a modular memory management layer, although it is mostly short-circuited for performance in the normal case), and to its small size, leaving more memory available for applications.
5.0 Port to the HP-PA MK6 kernel
The VM improvements are scheduled to be integrated in the MK6 microkernel and ported to the HP-PA. This system will probably perform even better because it can take advantage of collocation to allow the default pager to short-circuit the message interfaces to the devices. The EMMI (External Memory Manager Interfaces) will not be able to use short-circuiting in the short term because they are one-way messages, and only RPC-based interfaces can currently be short-circuited. Some work is being done at the RI to change the EMMI interfaces into RPCs.