Berkeley DB notes
These notes mostly describe the btree back end for our special db-1.85/1.86 variant. Some of this material should go into the official documentation to help operators understand possible database corruption characteristics.
Berkeley DB files are made up of fixed-size pages. Each back end (hash, btree) has its own way of organizing data into pages. Page zero of a database file is a metadata page that includes persistent parameters of the whole database.
Berkley DB has a user-space page cache called mpool. Pages in the cache can be in-use/pinned (
MPOOL_PINNED) and/or dirty (
MPOOL_DIRTY). There is a hash table for pages in the cache, and an LRU queue to help release memory used by unreferenced pages.
mpool_get gets an existing page. It checks to see if the requested page is already in memory in a cache and not pinned. (It will abort if the page is already pinned, unless the caller passes the
MPOOL_IGNOREPIN flag). It reads in the page from disk if necessary, and sets
MPOOL_PINNED unless the caller passes the
MPOOL_IGNOREPIN flag. If the cache size is larger than a configured maximum,
mpool_get tries to reuse unreferenced cache entries when reading in a previously uncached page. It does this by walking the LRU queue for unpinned pages (which might be dirty, in which case it will flush them to disk). If this fails to free up a cache entry,
mpool_get will allocate a new entry anyway, growing the cache beyond the nominal maximum size.
mpool_put releases a pinned page, clearing
MPOOL_PINNED. The caller can pass the
MPOOL_DIRTY flag to indicate that it modified the page and the page therefore needs to be written back to disk. Notably,
mpool_put does not change the order of the LRU queue; only
mpool_get does that, which means that the ordering of
mpool_put calls doesn't determine the order in which mpool flushes dirty pages to disk. If a subsequent
mpool_get fetches a dirty page before it is flushed to disk, the page moves to the tail of the LRU queue, possibly further delaying its being written to disk.
The btree back end is actually a B+tree. Pages in the tree are either internal pages or leaf pages. There is also a free list. Leaf pages contain key-value pairs in key-sorted order; internal pages contain keys (also sorted) with pointers to lower level pages. Pages have sibling links to pages to their left and right on the same level of the tree. (The free list also uses sibling links.) These sibling links help speed up some tree update operations and sequential traversal. Updates that split a page can touch a potentially multiple pages, especially if parent pages must split (possibly up to the root), leading to opportunities for corruption if some of the writes fail. Often, corruption will result in some items being inaccessible to sequential access, random access, or both. Rarely, loops will form in a sibling chain, causing infinite loops (at least until disk space exhaustion) during dumps.
Our bundled "db2" is actually probably db-1.86 with patches. db-1.85 was the version included by most BSD derivatives. db-1.86 made backward-incompatible changes in the hash back end (but not probably not the btree back end, though we haven't confirmed that).
There's some issue that prevents krb5_db2_promote_db (and thus kdb5_util load) from working properly with db-5.3, and possibly earlier releases.