logo_kerberos.gif

Difference between revisions of "Berkeley DB notes"

From K5Wiki
Jump to: navigation, search
(Created page with "==Internals== These notes mostly describe the btree back end for our special db-1.85/1.86 variant. Some of this material should go into the official documentation to help ope...")
 
Line 1: Line 1:
 
==Internals==
 
==Internals==
   
These notes mostly describe the btree back end for our special db-1.85/1.86 variant. Some of this material should go into the official documentation to help operators understand possible database corruption characteristics.
+
These notes mostly describe the btree back end for our special db-1.85/1.86 variant, as used for the "db2" KDB module. Some of this material should go into the official documentation to help operators understand possible database corruption characteristics.
   
 
Berkeley DB files are made up of fixed-size pages. Each back end (hash, btree) has its own way of organizing data into pages. Page zero of a database file is a metadata page that includes persistent parameters of the whole database.
 
Berkeley DB files are made up of fixed-size pages. Each back end (hash, btree) has its own way of organizing data into pages. Page zero of a database file is a metadata page that includes persistent parameters of the whole database.
Line 9: Line 9:
 
Berkley DB has a user-space page cache called mpool. Pages in the cache can be in-use/pinned (<code>MPOOL_PINNED</code>) and/or dirty (<code>MPOOL_DIRTY</code>). There is a hash table for pages in the cache, and an LRU queue to help release memory used by unreferenced pages.
 
Berkley DB has a user-space page cache called mpool. Pages in the cache can be in-use/pinned (<code>MPOOL_PINNED</code>) and/or dirty (<code>MPOOL_DIRTY</code>). There is a hash table for pages in the cache, and an LRU queue to help release memory used by unreferenced pages.
   
<code>mpool_get</code> gets an existing page. It checks to see if the requested page is already in memory in a cache and not pinned. (It will abort if the page is already pinned, unless the caller passes the <code>MPOOL_IGNOREPIN</code> flag). It reads in the page from disk if necessary, and sets <code>MPOOL_PINNED</code> unless the caller passes the <code>MPOOL_IGNOREPIN</code> flag. If the cache size is larger than a configured maximum, <code>mpool_get</code> tries to reuse unreferenced cache entries when reading in a previously uncached page. It does this by walking the LRU queue for unpinned pages (which might be dirty, in which case it will flush them to disk). If this fails to free up a cache entry, <code>mpool_get</code> will allocate a new entry anyway, growing the cache beyond the nominal maximum size.
 
  +
<code>mpool_open</code> opens a memory pool from an already-open file descriptor. It takes a <code>pagesize</code> and a <code>maxcache</code> parameter. <code>pagesize</code> is typically a persistent parameter of an upper layer (e.g., stored in the btree metadata page), while <code>maxcache</code> can be tuned for performance without changing persistent database parameters.
  +
 
<code>mpool_get</code> gets an existing page. It checks to see if the requested page is not pinned, and retrieves it from the cache if it's already in the cache. (If built with <code>-DDEBUG</code>, it will abort if the page is already pinned, unless the caller passes the <code>MPOOL_IGNOREPIN</code> flag). If the page is not in cache, <code>mpool_get</code> reads in the page from disk. Regardless of how it obtained the page, <code>mpool_get</code> sets <code>MPOOL_PINNED</code> unless the caller passes the <code>MPOOL_IGNOREPIN</code> flag. If the current cache size is at least <code>maxcache</code>, <code>mpool_get</code> tries to reuse unreferenced cache entries when reading in a previously uncached page (through static helper <code>mpool_bkt</code>). It does this by getting the first unpinned page from the LRU queue (which might be dirty, in which case it will flush it to disk). If this fails to free up a cache entry, <code>mpool_bkt</code> will allocate a new entry anyway, growing the cache beyond the <code>maxcache</code>.
   
 
<code>mpool_put</code> releases a pinned page, clearing <code>MPOOL_PINNED</code>. The caller can pass the <code>MPOOL_DIRTY</code> flag to indicate that it modified the page and the page therefore needs to be written back to disk. Notably, <code>mpool_put</code> does not change the order of the LRU queue; only <code>mpool_get</code> does that, which means that the ordering of <code>mpool_put</code> calls doesn't determine the order in which mpool flushes dirty pages to disk. If a subsequent <code>mpool_get</code> fetches a dirty page before it is flushed to disk, the page moves to the tail of the LRU queue, possibly further delaying its being written to disk.
 
<code>mpool_put</code> releases a pinned page, clearing <code>MPOOL_PINNED</code>. The caller can pass the <code>MPOOL_DIRTY</code> flag to indicate that it modified the page and the page therefore needs to be written back to disk. Notably, <code>mpool_put</code> does not change the order of the LRU queue; only <code>mpool_get</code> does that, which means that the ordering of <code>mpool_put</code> calls doesn't determine the order in which mpool flushes dirty pages to disk. If a subsequent <code>mpool_get</code> fetches a dirty page before it is flushed to disk, the page moves to the tail of the LRU queue, possibly further delaying its being written to disk.
  +
  +
<code>mpool_sync</code> writes dirty pages out to disk and calls <code>fsync</code>. The only part of the btree back end that calls it is <code>__bt_sync</code> (usually by way of <code>__bt_close</code>).
   
 
===btree===
 
===btree===
   
The btree back end is actually a [https://en.wikipedia.org/wiki/B%2B_tree B+tree]. Pages in the tree are either internal pages or leaf pages. There is also a free list. Leaf pages contain key-value pairs in key-sorted order; internal pages contain keys (also sorted) with pointers to lower level pages. Pages have sibling links to pages to their left and right on the same level of the tree. (The free list also uses sibling links.) These sibling links help speed up some tree update operations and sequential traversal. Updates that split a page can touch a potentially multiple pages, especially if parent pages must split (possibly up to the root), leading to opportunities for corruption if some of the writes fail. Often, corruption will result in some items being inaccessible to sequential access, random access, or both. Rarely, loops will form in a sibling chain, causing infinite loops (at least until disk space exhaustion) during dumps.
+
The btree back end is actually a [https://en.wikipedia.org/wiki/B%2B_tree B+tree]. Pages in the tree are either internal pages or leaf pages. There is also a free list (with its head pointer in the metadata page). Leaf pages contain key-value pairs in key-sorted order; internal pages contain keys (also sorted) paired with pointers to child pages. Pages have sibling links to pages to their left and right on the same level of the tree. (The free list also uses sibling links.) These sibling links help speed up some tree update operations and sequential traversal. Updates that split a page can touch a potentially multiple pages, especially if parent pages must split (possibly up to the root), leading to opportunities for corruption if some of the writes fail (due to crashes or power failures). Often, corruption will result in some items being inaccessible to sequential access, random access, or both. Observed instances of corruption appear to be inconsistencies between parent-child links and sibling links. Rarely, loops will form in a sibling chain, causing infinite loops (at least until disk space exhaustion) during dumps.
   
 
==Compatibility==
 
==Compatibility==

Revision as of 11:51, 13 August 2016

Internals

These notes mostly describe the btree back end for our special db-1.85/1.86 variant, as used for the "db2" KDB module. Some of this material should go into the official documentation to help operators understand possible database corruption characteristics.

Berkeley DB files are made up of fixed-size pages. Each back end (hash, btree) has its own way of organizing data into pages. Page zero of a database file is a metadata page that includes persistent parameters of the whole database.

mpool

Berkley DB has a user-space page cache called mpool. Pages in the cache can be in-use/pinned (MPOOL_PINNED) and/or dirty (MPOOL_DIRTY). There is a hash table for pages in the cache, and an LRU queue to help release memory used by unreferenced pages.

mpool_open opens a memory pool from an already-open file descriptor. It takes a pagesize and a maxcache parameter. pagesize is typically a persistent parameter of an upper layer (e.g., stored in the btree metadata page), while maxcache can be tuned for performance without changing persistent database parameters.

mpool_get gets an existing page. It checks to see if the requested page is not pinned, and retrieves it from the cache if it's already in the cache. (If built with -DDEBUG, it will abort if the page is already pinned, unless the caller passes the MPOOL_IGNOREPIN flag). If the page is not in cache, mpool_get reads in the page from disk. Regardless of how it obtained the page, mpool_get sets MPOOL_PINNED unless the caller passes the MPOOL_IGNOREPIN flag. If the current cache size is at least maxcache, mpool_get tries to reuse unreferenced cache entries when reading in a previously uncached page (through static helper mpool_bkt). It does this by getting the first unpinned page from the LRU queue (which might be dirty, in which case it will flush it to disk). If this fails to free up a cache entry, mpool_bkt will allocate a new entry anyway, growing the cache beyond the maxcache.

mpool_put releases a pinned page, clearing MPOOL_PINNED. The caller can pass the MPOOL_DIRTY flag to indicate that it modified the page and the page therefore needs to be written back to disk. Notably, mpool_put does not change the order of the LRU queue; only mpool_get does that, which means that the ordering of mpool_put calls doesn't determine the order in which mpool flushes dirty pages to disk. If a subsequent mpool_get fetches a dirty page before it is flushed to disk, the page moves to the tail of the LRU queue, possibly further delaying its being written to disk.

mpool_sync writes dirty pages out to disk and calls fsync. The only part of the btree back end that calls it is __bt_sync (usually by way of __bt_close).

btree

The btree back end is actually a B+tree. Pages in the tree are either internal pages or leaf pages. There is also a free list (with its head pointer in the metadata page). Leaf pages contain key-value pairs in key-sorted order; internal pages contain keys (also sorted) paired with pointers to child pages. Pages have sibling links to pages to their left and right on the same level of the tree. (The free list also uses sibling links.) These sibling links help speed up some tree update operations and sequential traversal. Updates that split a page can touch a potentially multiple pages, especially if parent pages must split (possibly up to the root), leading to opportunities for corruption if some of the writes fail (due to crashes or power failures). Often, corruption will result in some items being inaccessible to sequential access, random access, or both. Observed instances of corruption appear to be inconsistencies between parent-child links and sibling links. Rarely, loops will form in a sibling chain, causing infinite loops (at least until disk space exhaustion) during dumps.

Compatibility

Our bundled "db2" is actually probably db-1.86 with patches. db-1.85 was the version included by most BSD derivatives. db-1.86 made backward-incompatible changes in the hash back end (but not probably not the btree back end, though we haven't confirmed that).

There's some issue that prevents krb5_db2_promote_db (and thus kdb5_util load) from working properly with db-5.3, and possibly earlier releases.