Berkeley DB notes
These notes mostly describe the btree back end for our special db-1.85/1.86 variant, as used for the "db2" KDB module. Some of this material should go into the official documentation to help operators understand possible database corruption characteristics.
Berkeley DB files are made up of fixed-size pages. Each back end (hash, btree) has its own way of organizing data into pages. Page zero of a database file is a metadata page that includes persistent parameters of the whole database.
Berkley DB has a user-space page cache called mpool. Pages in the cache can be in-use/pinned (
MPOOL_PINNED) and/or dirty (
MPOOL_DIRTY). There is a hash table for pages in the cache, and an LRU queue to help release memory used by unreferenced pages.
mpool_open opens a memory pool from an already-open file descriptor. It takes a
pagesize and a
pagesize is typically a persistent parameter of an upper layer (e.g., stored in the btree metadata page), while
maxcache can be tuned for performance without changing persistent database parameters.
mpool_get gets an existing page. It checks to see if the requested page is not pinned, and retrieves it from the cache if it's already in the cache. (If built with
-DDEBUG, it will abort if the page is already pinned, unless the caller passes the
MPOOL_IGNOREPIN flag). If the page is not in cache,
mpool_get reads in the page from disk. Regardless of how it obtained the page,
MPOOL_PINNED unless the caller passes the
MPOOL_IGNOREPIN flag. If the current cache size is at least
mpool_get tries to reuse unreferenced cache entries when reading in a previously uncached page (through static helper
mpool_bkt). It does this by getting the first unpinned page from the LRU queue (which might be dirty, in which case it will flush it to disk). If this fails to free up a cache entry,
mpool_bkt will allocate a new entry anyway, growing the cache beyond the
mpool_put releases a pinned page, clearing
MPOOL_PINNED. The caller can pass the
MPOOL_DIRTY flag to indicate that it modified the page and the page therefore needs to be written back to disk. Notably,
mpool_put does not change the order of the LRU queue; only
mpool_get does that, which means that the ordering of
mpool_put calls doesn't determine the order in which mpool flushes dirty pages to disk. If a subsequent
mpool_get fetches a dirty page before it is flushed to disk, the page moves to the tail of the LRU queue, possibly further delaying its being written to disk.
mpool_sync writes dirty pages out to disk and calls
fsync. The only part of the btree back end that calls it is
__bt_sync (usually by way of
The btree back end is actually a B+tree. Pages in the tree are either internal pages or leaf pages. There is also a free list (with its head pointer in the metadata page). Leaf pages contain key-value pairs in key-sorted order; internal pages contain keys (also sorted) paired with pointers to child pages. Pages have sibling links to pages to their left and right on the same level of the tree. (The free list also uses sibling links.) These sibling links help speed up some tree update operations and sequential traversal. Updates that split a page can touch a potentially multiple pages, especially if parent pages must split (possibly up to the root), leading to opportunities for corruption if some of the writes fail (due to crashes or power failures). Often, corruption will result in some items being inaccessible to sequential access, random access, or both. Observed instances of corruption appear to be inconsistencies between parent-child links and sibling links. Rarely, loops will form in a sibling chain, causing infinite loops (at least until disk space exhaustion) during dumps. Sibling link corruption can lead to worse corruption if a page with a corrupted sibling link splits or is deleted.
Page splits of non-root pages always preserve the page being split as the left-hand page. Splitting the root creates two new pages and converts the root to an internal page if it started out as a leaf page.
Even though every page in a tree should have at least two children or items, some pages can end up with only one due to deletions. Only when a deletion would completely empty a page is that page deleted.
Our bundled "db2" is actually probably db-1.86 with patches. db-1.85 was the version included by most BSD derivatives. db-1.86 made backward-incompatible changes in the hash back end (but not probably not the btree back end, though we haven't confirmed that).
There's some issue that prevents krb5_db2_promote_db (and thus kdb5_util load) from working properly with db-5.3, and possibly earlier releases.
bt_conv.c might not work correctly for leaf records with small keys but big (overflow) data. This probably can't be caught by the current test suite, which can only check that the byteswapping code is internally consistent. (A proper test for this would need a database file created on an opposite-endian platform.)
Overflow page pointers for big keys (and sometimes big data) are unaligned; this can cause problems with byte swapping on platforms that enforce strict alignment.
The major open-source BSD derivatives NetBSD, FreeBSD, and OpenBSD have db-1.85/1.86 in their libc. They have shared various patches back and forth, but there are still some divergences among them. These can become relevant when trying to minimize diffs.
NetBSD seems to have converted from BSD fixed-width types (e.g.,
u_int32_t) to C99/POSIX fixed-width type (e.g.,
uint32_t), but FreeBSD and OpenBSD have not. FreeBSD and NetBSD seem to have converted function declarations to prototypes, but OpenBSD has not. NetBSD and OpenBSD have eliminated BSD type names such as
u_long, but FreeBSD has not.