Philippe Bernadat
The Dynamic Buffer Cache was designed to address this issue in a simple and straightforward way, and a first implementation was made which led to promising results.
However, with this architecture, the pager for the dynamic cache runs in the same task that maps the memory object - the OSF/1 server. This is a special case in EMMI, and a few additional policies have been added to it, which solve problems raised by this situation and also allow performance improvements.
The pool of the virtually available cache buffers is linked as a freelist (the buf_table), that can also be accessed through an object offset index. Whenever a buffer is needed, the first free address in that list is allocated, and the buf_table entry for this offset is used to point to a dynamically allocated buffer handling structure (the original buf structure). This buffer is then given to the cache handling routines that use their traditional hashing algorithms to retrieve its content. From then on, this buffer may be accessed either by the buffer cache, through its handling buf structure for data related inquiries, or by the cache pager for VM related update operations, through its offset index in the buf_table (Fig. 1).
Since data buffer size is generally a multiple of the page size, each buffer header has a bitmap that indicates which pages are currently valid.
New buffers will be allocated from this pool until the kernel indicates that available memory is low, by returning unused pages to the cache pager. At this point allocation of new buffers is suspended, and the cache traditional freelists are used to satisfy new buffer requests.
If all pages in a buffer are returned, the buf_table entry corresponding to its first page offset is freed and chained on the free list, and the corresponding buf structure is deallocated. Therefore the cache size shrinks when memory is needed for other purposes.
The cache pager is implemented as a multithreaded server running in the same context as the OSF/1 server task. The number of servicing threads may change to cope with varying traffic intensity, but as a general rule there will always be at least two available threads to service incoming requests. More details on this will be given in a following section.
In the first case, there is no need for the kernel to contact the pager, since the entire the page content will be provided by the overwrite operation. To avoid this interaction, a new kernel pageout policy has been created, called ``silent overwrite'', that tells the kernel not to request data from the managing pager when the page fault originates from a complete overwrite operation.
In the second case, there is also no need to contact the pager, but since this situation cannot be detected by the kernel as in the previous case, the pager needs to provide a hint to the kernel. Since it runs in the same context as the cache managing routines, the pager is aware of situations in which buffers will be accessed to be overwritten. The ``unavailable in advance'' feature, allows the pager to hint to the kernel not to issue a data request when handling a page fault. This situation is triggered when a memory_object_data_unavailable call is received for an unmapped page prior to a data fault. This call should be issued by the pager on behalf of the buffer cache handling routines whenever they expect an access to an unmapped buffer page whose previous content is irrelevant.
The third case is the only one where the pager needs to be contacted to restore the page content. In this case the pager contacts the data stream which is responsible for that page content, and requests the page through a standard read operation, and then supplies it back to the kernel.
The introduction of ``silent overwrite'' and ``unavailable in advance'' features also greatly reduces the number of kernel-pager interactions.
On the other hand, whenever a page out request is issued on a buffer page, the corresponding buffer may be busy because some other cache related operations are being performed on it, and since the pager and the memory being accessed are in the same task, some deadlock situations may arise.
All this led us to the introduction of another EMMI upcall, called memory_object_discard_request. The purpose of this call is to provide a hint to the buffer pager on which pages have not been accessed recently and therefore should be discarded. Based on status information associated with the buffer, the pager will decide if the page may be discarded or not. If the page cannot be immediately discarded because the buffer is BUSY, no action is performed, the kernel keeps it mapped, and will wait for the next pageout round for possibly discarding it. Otherwise, if the page is dirty but not BUSY, it is saved and then discarded.
It is important to note that this implies that the page has to be kept mapped across the discard request upcall, because in case the buffer is BUSY or DIRTY, the page will most probably be accessed for its content at its original address. Only when the pager gets exclusive access to the buffer managing structure, is the page deallocated, by an explicit call to memory_object_lock_request.
In situations of intense pageout, the kernel still has the possibility of issuing mandatory pageout requests (memory_object_data_return). But since the pager will respond in this case with an immediate data_supply for the same page, the kernel will eventually redirect it to the default pager. In fact, in the current implementation we measured only a very small percentage of mandatory pageout requests, even during severe AIMIII tests This means that the advisory pageout functionality is satisfactory for most pageout situations.
Dynamic buffer allocation is inserted in the cache allocation mechanisms through the creation of an additional freelist slot, called BQ_MAP, which is logically inserted between the BQ_AGE and BQ_LRU slots (Fig. 2). This means that, when no aged buffers exist (mainly at system initialization), a pool of dynamic buffers is initially allocated, which is immediately chained in the hashlists and in the LRU freelist. Once the content of these buffers starts to age following the normal cache aging algorithms, they start being chained in the BQ_AGE list, from which requests for new buffers will then be satisfied.
Dynamic allocation may theoretically pursue until the buf_table contains no more free elements, which means that the virtual cache is exhausted. In practice, this seldom happens, because dynamic allocation of more buffers starts competing with other requests for memory. This situation is detected by the cache throttling mechanism, which governs dynamic allocation. When this happens, from the cache allocation point of view, everything works as if the BQ_AGE slot had disappeared, and buffers start to be recycled and allocated using the standard cache mechanisms.
It is implemented by incrementing a pageout count each time a discard request or a data return upcall is received by the cache pager, and zeroing it every second, in the context of a time-out thread. The pageout rate obtained is then compared to low and high water mark rates. When the high water mark is exceeded, dynamic allocation of more buffers is suspended, and the number of free pages in the system is memorized. Dynamic allocation only resumes when the paging rate decreases below the low water mark, and the number of free pages has increased significantly. When resumed, allocation is allowed only up to a certain limit, which will be a percentage of the total available memory (free pages).
Suspending dynamic allocation may be decided every second or at each pageout request, while resuming may only be decided every second. The values for low and high water mark rates were first set arbitrarily and then tuned experimentally for optimal performance, and are easily reconfigurable at compile time.
This means that if the pager receives a request for a buffer page while its header is in a BUSY state because of some file system operation, and if to handle this request the pager needs to gain exclusive access to the buffer, a deadlock situation is created. This is even more critical in the OSF/1 Server UNIPROC configuration, where a single mutex is used for the whole server thread synchronization. In this case, no discrete locks exist, and setting the BUSY flag on a buffer header can only be done if the UNIPROC mutex is held.
To avoid this problem, all the pager operations were written such as to minimize the number of sections where the master mutex needs to be held.
This is typically the case when a buffer which spans several pages has some of its pages discarded and the other(s) modified. After a while the discarded pages are needed again, but the buffer cannot be read as a whole through the file system because the content of the other pages would be lost. In this case, the cache handling routines detect that the buffer has one or more missing pages by checking the page bits on the buffer header, and issues individual page read requests. Data is read to a temporary buffer and then provided in advance by the pager through a memory_object_data_supply call, thus avoiding a page fault on a subsequent BUSY buffer access which would create a deadlock situation.
Checking the page bits on the buffer header is thus always performed whenever a buffer is looked up and found in the cache and therefore moved from an idle to a busy state.
This benchmark consists of repeatedly copying an 8 Mbyte file from one place to another on the same file system partition, and averaging the elapsed time. This is the test where the most significant improvements were seen, mainly because the file to be copied may now be kept entirely in the buffer cache. As we can see in Fig. 3, the comparison between the two versions of the OSF/1 MK Server shows a performance improvement of almost 100% for the dynamic cache version. In comparison with HP_UX, there is still a performance degradation of 15% in this particular benchmark, which is probably due to system call overhead.
Results in Fig. 4 show a small performance increase for small numbers of users, and in Fig. 5 a significant performance improvement for large numbers of users, namely when the load is such that the pageout activity becomes very intense.
We believe that this is due to the fact that the cache is able to adapt its size to the minimum needed, therefore not competing with other system needs for memory, and that a typical static cache problem, know as double paging, is now avoided.
Double paging happens in the static cache system because in this case, buffer pages are just simply vm_allocated and therefore backed by the default pager. On memory shortage, they are paged out to disk even if they are also backed by the file system, untouched and stale. Whenever one of these buffers happens to be recycled and rewritten, its pages will be brought from the paging space and then overwritten. In the dynamic cache case, unused and untouched pages are just discarded, which means that only one overwrite I/O operation is performed instead of three (one for pageout, one for pagein, and the overwrite).
However, this approach is simple and straightforward, and a more global solution should be explored, where caching file data would be more integrated with demand loading binary files: currently, when the dynamic cache is configured, the server comes up with two different pagers - the vnode pager and the cache pager. Typically, during the process of handling a page fault for the execution of a binary file, a request is made to the vnode pager, which then requests data from the file system, which then interacts with the cache pager to store the data in a buffer. This data is then provided to the kernel, and therefore two different read-only copies of the same page are kept: one in the server's buffer cache and another in the kernel's page cache. We think that there is room for optimization here.