A Simple Dynamic Buffer Cache
using Mach VM

José Rogado
Philippe Bernadat

OSF Research Institute
2, Ave. de Vignate
38610 Gières
France
Tel: (+33) 76 63 48 66
E-mail: [email protected], [email protected]

Abstract:

This document describes the principles and the implementation of a Dynamic Buffer Cache for the OSF/1 Server. This cache is based on Mach VM EMMI, and is organized as an external manager for the memory allocated to the buffer cache. Dynamic buffer allocation has been integrated with traditional buffer allocation algorithms, and a mechanism for throttling dynamic allocation has been introduced. In addition, to reduce kernel-pager interactions, a certain number of kernel optimizations have been made. This paper also gives performance figures that show the benefits of the cache extensibility.

Rationale

During the process of improving the OSF/1 server performance on the HP-PA platform, it was discovered that the fixed size the server buffer cache was of one of the obstacles to good general file system throughput. When compared with other systems like HP-UX, which have a dynamic VM allocated buffer cache, the ability of the OSF/1 server to keep most frequently accessed files in memory was limited. Tests like AIMIII showed 50% performance degradation mainly for large numbers of users, while micro-benchmarks involving large file copies showed more than 100% degradation.

The Dynamic Buffer Cache was designed to address this issue in a simple and straightforward way, and a first implementation was made which led to promising results.

Non-goal

Unlike existing implementations of mapped files, where Mach VM functionality has also been used to cache file data we did not pursue the goal of providing a general mechanism for short circuiting the traditional OSF/1 buffer cache. On the contrary, for simplicity reasons, we only wanted the buffer cache functionality to benefit from dynamic buffer allocation.

Architectural Choices

This option led us to the following architectural choices:

Use a single memory object to contain the whole memory allocated to the buffer cache: this option greatly simplifies the pager, since no object to pager relationship information needs to be kept.

Map the cache memory at the buffer level: this allows the modifications for the dynamic buffer cache to be restricted to the buffer handling routines.

Rely on Mach VM information to manage data: buffers with discarded pages are elected for reallocation.

Integrate dynamic allocation with existing buffer cache policies: in case of VM shortage, the traditional cache algorithms and free list management are used to handle the already allocated pool of buffers.

However, with this architecture, the pager for the dynamic cache runs in the same task that maps the memory object - the OSF/1 server. This is a special case in EMMI, and a few additional policies have been added to it, which solve problems raised by this situation and also allow performance improvements.

Basic Structures and Mechanisms

The buffer cache object is initialized at system bootstrap time, and the association with its pager is created. Its maximal size is set to a large percentage of the available free memory, but its working size will be adjusted by effective memory usage.

The pool of the virtually available cache buffers is linked as a freelist (the buf_table), that can also be accessed through an object offset index. Whenever a buffer is needed, the first free address in that list is allocated, and the buf_table entry for this offset is used to point to a dynamically allocated buffer handling structure (the original buf structure). This buffer is then given to the cache handling routines that use their traditional hashing algorithms to retrieve its content. From then on, this buffer may be accessed either by the buffer cache, through its handling buf structure for data related inquiries, or by the cache pager for VM related update operations, through its offset index in the buf_table (Fig. 1).

Since data buffer size is generally a multiple of the page size, each buffer header has a bitmap that indicates which pages are currently valid.

New buffers will be allocated from this pool until the kernel indicates that available memory is low, by returning unused pages to the cache pager. At this point allocation of new buffers is suspended, and the cache traditional freelists are used to satisfy new buffer requests.

If all pages in a buffer are returned, the buf_table entry corresponding to its first page offset is freed and chained on the free list, and the corresponding buf structure is deallocated. Therefore the cache size shrinks when memory is needed for other purposes.

The cache pager is implemented as a multithreaded server running in the same context as the OSF/1 server task. The number of servicing threads may change to cope with varying traffic intensity, but as a general rule there will always be at least two available threads to service incoming requests. More details on this will be given in a following section.

Interactions between the Kernel and the Cache Pager

The kernel and the cache pager interact by means of the EMMI interface. The kernel requests data from the pager whenever unmapped buffer pages are accessed, and returns data back to the pager when the corresponding page has been elected to be paged out. Because the pager runs in the same task that maps and accesses the memory object, a few extensions have been introduced to avoid unnecessary interactions and blocking situations.

Providing Buffer Pages

Traditionally, when an unmapped page is accessed, a page fault is generated, and the kernel treats the exception by mapping the page and contacting its object pager to provide its initial content. However, in the buffer cache case, when an unmapped buffer page is accessed, we can distinguish different cases, and try to minimize the number of kernel pager interactions:

The access comes from a device_read_overwrite operation, which means either the previous content of the page is discarded, or that the page content is being written for the first time.

The access comes from a simple data copy operation, but the previous page content is being fully overwritten or is required to be zero filled.

The page is accessed for its original content, which had been discarded on a previous page out request operation.

In the first case, there is no need for the kernel to contact the pager, since the entire the page content will be provided by the overwrite operation. To avoid this interaction, a new kernel pageout policy has been created, called ``silent overwrite'', that tells the kernel not to request data from the managing pager when the page fault originates from a complete overwrite operation.

In the second case, there is also no need to contact the pager, but since this situation cannot be detected by the kernel as in the previous case, the pager needs to provide a hint to the kernel. Since it runs in the same context as the cache managing routines, the pager is aware of situations in which buffers will be accessed to be overwritten. The ``unavailable in advance'' feature, allows the pager to hint to the kernel not to issue a data request when handling a page fault. This situation is triggered when a memory_object_data_unavailable call is received for an unmapped page prior to a data fault. This call should be issued by the pager on behalf of the buffer cache handling routines whenever they expect an access to an unmapped buffer page whose previous content is irrelevant.

The third case is the only one where the pager needs to be contacted to restore the page content. In this case the pager contacts the data stream which is responsible for that page content, and requests the page through a standard read operation, and then supplies it back to the kernel.

The introduction of ``silent overwrite'' and ``unavailable in advance'' features also greatly reduces the number of kernel-pager interactions.

Returning Buffer Pages

In situations of memory shortage, the kernel elects the least recently accessed pages for pageout. Normally this should tell the buffer pager and cache handler that this buffer content is becoming aged and therefore should be recycled. In fact, this situation is slightly more complex, because a buffer spans several pages, and while some of the buffer pages are rarely accessed, others may be accessed very often. In practice, it often happens that a particular buffer header structure has to be kept around just for one page, while the others have been discarded.

On the other hand, whenever a page out request is issued on a buffer page, the corresponding buffer may be busy because some other cache related operations are being performed on it, and since the pager and the memory being accessed are in the same task, some deadlock situations may arise.

All this led us to the introduction of another EMMI upcall, called memory_object_discard_request. The purpose of this call is to provide a hint to the buffer pager on which pages have not been accessed recently and therefore should be discarded. Based on status information associated with the buffer, the pager will decide if the page may be discarded or not. If the page cannot be immediately discarded because the buffer is BUSY, no action is performed, the kernel keeps it mapped, and will wait for the next pageout round for possibly discarding it. Otherwise, if the page is dirty but not BUSY, it is saved and then discarded.

It is important to note that this implies that the page has to be kept mapped across the discard request upcall, because in case the buffer is BUSY or DIRTY, the page will most probably be accessed for its content at its original address. Only when the pager gets exclusive access to the buffer managing structure, is the page deallocated, by an explicit call to memory_object_lock_request.

In situations of intense pageout, the kernel still has the possibility of issuing mandatory pageout requests (memory_object_data_return). But since the pager will respond in this case with an immediate data_supply for the same page, the kernel will eventually redirect it to the default pager. In fact, in the current implementation we measured only a very small percentage of mandatory pageout requests, even during severe AIMIII tests This means that the advisory pageout functionality is satisfactory for most pageout situations.

Dynamic Cache Allocation Policies

Integration with the traditional cache freelists

Dynamic buffer allocation is inserted in the cache allocation mechanisms through the creation of an additional freelist slot, called BQ_MAP, which is logically inserted between the BQ_AGE and BQ_LRU slots (Fig. 2). This means that, when no aged buffers exist (mainly at system initialization), a pool of dynamic buffers is initially allocated, which is immediately chained in the hashlists and in the LRU freelist. Once the content of these buffers starts to age following the normal cache aging algorithms, they start being chained in the BQ_AGE list, from which requests for new buffers will then be satisfied.

Dynamic allocation may theoretically pursue until the buf_table contains no more free elements, which means that the virtual cache is exhausted. In practice, this seldom happens, because dynamic allocation of more buffers starts competing with other requests for memory. This situation is detected by the cache throttling mechanism, which governs dynamic allocation. When this happens, from the cache allocation point of view, everything works as if the BQ_AGE slot had disappeared, and buffers start to be recycled and allocated using the standard cache mechanisms.

Allocation Throttling Mechanism

This mechanism limits dynamic allocation and allows the buffer cache size to decrease in situations of memory shortage.

It is implemented by incrementing a pageout count each time a discard request or a data return upcall is received by the cache pager, and zeroing it every second, in the context of a time-out thread. The pageout rate obtained is then compared to low and high water mark rates. When the high water mark is exceeded, dynamic allocation of more buffers is suspended, and the number of free pages in the system is memorized. Dynamic allocation only resumes when the paging rate decreases below the low water mark, and the number of free pages has increased significantly. When resumed, allocation is allowed only up to a certain limit, which will be a percentage of the total available memory (free pages).

Suspending dynamic allocation may be decided every second or at each pageout request, while resuming may only be decided every second. The values for low and high water mark rates were first set arbitrarily and then tuned experimentally for optimal performance, and are easily reconfigurable at compile time.

Buffer Locking Protocol

The main problem we ran into when developing the dynamic cache was related to the implementation of mutual exclusion between the cache pager, and the buffer cache handling routines, which, as we pointed before, run in the same task.

Exclusive Access to buffer Headers

Exclusive access to a buffer header is governed by the BUSY flag, which is protected by a per buffer read/write lock. In a multithreaded environment with fine grain locking, acquiring exclusive access to a buffer means taking the write lock, and setting the flag to BUSY. This state is kept during the whole duration of the operation (I/O, data copy, etc...).

This means that if the pager receives a request for a buffer page while its header is in a BUSY state because of some file system operation, and if to handle this request the pager needs to gain exclusive access to the buffer, a deadlock situation is created. This is even more critical in the OSF/1 Server UNIPROC configuration, where a single mutex is used for the whole server thread synchronization. In this case, no discrete locks exist, and setting the BUSY flag on a buffer header can only be done if the UNIPROC mutex is held.

To avoid this problem, all the pager operations were written such as to minimize the number of sections where the master mutex needs to be held.

Avoiding Taking Page Faults

As described before, the silent overwrite and unavailable in advance optimizations prevent page faults from being redirected to the pager in some specific situations. But there are other cases where buffer pages need to be reinstalled after being discarded.

This is typically the case when a buffer which spans several pages has some of its pages discarded and the other(s) modified. After a while the discarded pages are needed again, but the buffer cannot be read as a whole through the file system because the content of the other pages would be lost. In this case, the cache handling routines detect that the buffer has one or more missing pages by checking the page bits on the buffer header, and issues individual page read requests. Data is read to a temporary buffer and then provided in advance by the pager through a memory_object_data_supply call, thus avoiding a page fault on a subsequent BUSY buffer access which would create a deadlock situation.

Checking the page bits on the buffer header is thus always performed whenever a buffer is looked up and found in the cache and therefore moved from an idle to a busy state.

Performance Results

A series of benchmarks were run to compare three different systems: the MK-PA OSF/1 server with and without dynamic caching, and HP-UX V.9.

The Copy Large File micro-benchmark

This benchmark consists of repeatedly copying an 8 Mbyte file from one place to another on the same file system partition, and averaging the elapsed time. This is the test where the most significant improvements were seen, mainly because the file to be copied may now be kept entirely in the buffer cache. As we can see in Fig. 3, the comparison between the two versions of the OSF/1 MK Server shows a performance improvement of almost 100% for the dynamic cache version. In comparison with HP_UX, there is still a performance degradation of 15% in this particular benchmark, which is probably due to system call overhead.

AIMIII Benchmarks

The AIM3 V2 performance benchmark suite was run on the same three systems, first for a small load (1-8 users) and then for heavier loads (20 - 300 users).

Results in Fig. 4 show a small performance increase for small numbers of users, and in Fig. 5 a significant performance improvement for large numbers of users, namely when the load is such that the pageout activity becomes very intense.

We believe that this is due to the fact that the cache is able to adapt its size to the minimum needed, therefore not competing with other system needs for memory, and that a typical static cache problem, know as double paging, is now avoided.

Double paging happens in the static cache system because in this case, buffer pages are just simply vm_allocated and therefore backed by the default pager. On memory shortage, they are paged out to disk even if they are also backed by the file system, untouched and stale. Whenever one of these buffers happens to be recycled and rewritten, its pages will be brought from the paging space and then overwritten. In the dynamic cache case, unused and untouched pages are just discarded, which means that only one overwrite I/O operation is performed instead of three (one for pageout, one for pagein, and the overwrite).

Conclusion

We think that these results validate this dynamic cache approach for solving the buffer cache problem on Mach based OS servers. It resulted in significant performance improvements on all file system operations that depend on the ability to cache large amounts of data. On the other hand, the fact that the number of kernel - pager interactions has been minimized by a few modifications to the EMMI, implied that no significant overhead was introduced.

However, this approach is simple and straightforward, and a more global solution should be explored, where caching file data would be more integrated with demand loading binary files: currently, when the dynamic cache is configured, the server comes up with two different pagers - the vnode pager and the cache pager. Typically, during the process of handling a page fault for the execution of a binary file, a request is made to the vnode pager, which then requests data from the file system, which then interacts with the cache pager to store the data in a buffer. This data is then provided to the kernel, and therefore two different read-only copies of the same page are kept: one in the server's buffer cache and another in the kernel's page cache. We think that there is room for optimization here.

A Simple Dynamic Buffer Cache using Mach VM

José Rogado Philippe Bernadat