Difference between revisions of "Berkeley DB notes"
(Created page with "==Internals== These notes mostly describe the btree back end for our special db-1.85/1.86 variant. Some of this material should go into the official documentation to help ope...") |
(No difference)
|
Revision as of 17:37, 12 August 2016
Contents
Internals
These notes mostly describe the btree back end for our special db-1.85/1.86 variant. Some of this material should go into the official documentation to help operators understand possible database corruption characteristics.
Berkeley DB files are made up of fixed-size pages. Each back end (hash, btree) has its own way of organizing data into pages. Page zero of a database file is a metadata page that includes persistent parameters of the whole database.
mpool
Berkley DB has a user-space page cache called mpool. Pages in the cache can be in-use/pinned (MPOOL_PINNED
) and/or dirty (MPOOL_DIRTY
). There is a hash table for pages in the cache, and an LRU queue to help release memory used by unreferenced pages.
mpool_get
gets an existing page. It checks to see if the requested page is already in memory in a cache and not pinned. (It will abort if the page is already pinned, unless the caller passes the MPOOL_IGNOREPIN
flag). It reads in the page from disk if necessary, and sets MPOOL_PINNED
unless the caller passes the MPOOL_IGNOREPIN
flag. If the cache size is larger than a configured maximum, mpool_get
tries to reuse unreferenced cache entries when reading in a previously uncached page. It does this by walking the LRU queue for unpinned pages (which might be dirty, in which case it will flush them to disk). If this fails to free up a cache entry, mpool_get
will allocate a new entry anyway, growing the cache beyond the nominal maximum size.
mpool_put
releases a pinned page, clearing MPOOL_PINNED
. The caller can pass the MPOOL_DIRTY
flag to indicate that it modified the page and the page therefore needs to be written back to disk. Notably, mpool_put
does not change the order of the LRU queue; only mpool_get
does that, which means that the ordering of mpool_put
calls doesn't determine the order in which mpool flushes dirty pages to disk. If a subsequent mpool_get
fetches a dirty page before it is flushed to disk, the page moves to the tail of the LRU queue, possibly further delaying its being written to disk.
btree
The btree back end is actually a B+tree. Pages in the tree are either internal pages or leaf pages. There is also a free list. Leaf pages contain key-value pairs in key-sorted order; internal pages contain keys (also sorted) with pointers to lower level pages. Pages have sibling links to pages to their left and right on the same level of the tree. (The free list also uses sibling links.) These sibling links help speed up some tree update operations and sequential traversal. Updates that split a page can touch a potentially multiple pages, especially if parent pages must split (possibly up to the root), leading to opportunities for corruption if some of the writes fail. Often, corruption will result in some items being inaccessible to sequential access, random access, or both. Rarely, loops will form in a sibling chain, causing infinite loops (at least until disk space exhaustion) during dumps.
Compatibility
Our bundled "db2" is actually probably db-1.86 with patches. db-1.85 was the version included by most BSD derivatives. db-1.86 made backward-incompatible changes in the hash back end (but not probably not the btree back end, though we haven't confirmed that).
There's some issue that prevents krb5_db2_promote_db (and thus kdb5_util load) from working properly with db-5.3, and possibly earlier releases.