https://k5wiki.kerberos.org/wiki?title=Berkeley_DB_notes&feed=atom&action=historyBerkeley DB notes - Revision history2024-03-28T15:27:30ZRevision history for this page on the wikiMediaWiki 1.27.4https://k5wiki.kerberos.org/wiki?title=Berkeley_DB_notes&diff=5658&oldid=prevTomYu: /* Bugs */2016-08-26T14:40:08Z<p><span dir="auto"><span class="autocomment">Bugs</span></span></p>
<table class="diff diff-contentalign-left" data-mw="interface">
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr style='vertical-align: top;' lang='en'>
<td colspan='2' style="background-color: white; color:black; text-align: center;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black; text-align: center;">Revision as of 14:40, 26 August 2016</td>
</tr><tr>
<td colspan="2" class="diff-lineno">Line 34:</td>
<td colspan="2" class="diff-lineno">Line 34:</td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"></td>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"><div>Byteswapping in <code>bt_conv.c</code> might not work correctly for leaf records with small keys but big (overflow) data. This probably can't be caught by the current test suite, which can only check that the byteswapping code is internally consistent. (A proper test for this would need a database file created on an opposite-endian platform.)</div></td>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"><div>Byteswapping in <code>bt_conv.c</code> might not work correctly for leaf records with small keys but big (overflow) data. This probably can't be caught by the current test suite, which can only check that the byteswapping code is internally consistent. (A proper test for this would need a database file created on an opposite-endian platform.)</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>Overflow page pointers for big keys (and sometimes big data) are unaligned; this can cause problems with byte swapping on platforms that enforce strict alignment.</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"></td>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"><div>==BSD "upstreams"==</div></td>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"><div>==BSD "upstreams"==</div></td>
</tr>
</table>TomYuhttps://k5wiki.kerberos.org/wiki?title=Berkeley_DB_notes&diff=5657&oldid=prevTomYu at 12:18, 26 August 20162016-08-26T12:18:47Z<p></p>
<table class="diff diff-contentalign-left" data-mw="interface">
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr style='vertical-align: top;' lang='en'>
<td colspan='2' style="background-color: white; color:black; text-align: center;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black; text-align: center;">Revision as of 12:18, 26 August 2016</td>
</tr><tr>
<td colspan="2" class="diff-lineno">Line 30:</td>
<td colspan="2" class="diff-lineno">Line 30:</td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"></td>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"><div>There's some issue that prevents krb5_db2_promote_db (and thus kdb5_util load) from working properly with db-5.3, and possibly earlier releases.</div></td>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"><div>There's some issue that prevents krb5_db2_promote_db (and thus kdb5_util load) from working properly with db-5.3, and possibly earlier releases.</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>==Bugs==</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>Byteswapping in <code>bt_conv.c</code> might not work correctly for leaf records with small keys but big (overflow) data. This probably can't be caught by the current test suite, which can only check that the byteswapping code is internally consistent. (A proper test for this would need a database file created on an opposite-endian platform.)</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"></td>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"><div>==BSD "upstreams"==</div></td>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"><div>==BSD "upstreams"==</div></td>
</tr>
</table>TomYuhttps://k5wiki.kerberos.org/wiki?title=Berkeley_DB_notes&diff=5656&oldid=prevTomYu at 19:56, 25 August 20162016-08-25T19:56:26Z<p></p>
<table class="diff diff-contentalign-left" data-mw="interface">
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr style='vertical-align: top;' lang='en'>
<td colspan='2' style="background-color: white; color:black; text-align: center;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black; text-align: center;">Revision as of 19:56, 25 August 2016</td>
</tr><tr>
<td colspan="2" class="diff-lineno">Line 30:</td>
<td colspan="2" class="diff-lineno">Line 30:</td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"></td>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"><div>There's some issue that prevents krb5_db2_promote_db (and thus kdb5_util load) from working properly with db-5.3, and possibly earlier releases.</div></td>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"><div>There's some issue that prevents krb5_db2_promote_db (and thus kdb5_util load) from working properly with db-5.3, and possibly earlier releases.</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>==BSD "upstreams"==</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>The major open-source BSD derivatives NetBSD, FreeBSD, and OpenBSD have db-1.85/1.86 in their libc. They have shared various patches back and forth, but there are still some divergences among them. These can become relevant when trying to minimize diffs.</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>NetBSD seems to have converted from BSD fixed-width types (e.g., <code>u_int32_t</code>) to C99/POSIX fixed-width type (e.g., <code>uint32_t</code>), but FreeBSD and OpenBSD have not. FreeBSD and NetBSD seem to have converted function declarations to prototypes, but OpenBSD has not. NetBSD and OpenBSD have eliminated BSD type names such as <code>u_long</code>, but FreeBSD has not.</div></td>
</tr>
</table>TomYuhttps://k5wiki.kerberos.org/wiki?title=Berkeley_DB_notes&diff=5654&oldid=prevTomYu at 16:14, 15 August 20162016-08-15T16:14:43Z<p></p>
<table class="diff diff-contentalign-left" data-mw="interface">
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr style='vertical-align: top;' lang='en'>
<td colspan='2' style="background-color: white; color:black; text-align: center;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black; text-align: center;">Revision as of 16:14, 15 August 2016</td>
</tr><tr>
<td colspan="2" class="diff-lineno">Line 19:</td>
<td colspan="2" class="diff-lineno">Line 19:</td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"><div>===btree===</div></td>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"><div>===btree===</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"></td>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"></td>
</tr>
<tr>
<td class="diff-marker">−</td>
<td style="color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>The btree back end is actually a [https://en.wikipedia.org/wiki/B%2B_tree B+tree]. Pages in the tree are either internal pages or leaf pages. There is also a free list (with its head pointer in the metadata page). Leaf pages contain key-value pairs in key-sorted order; internal pages contain keys (also sorted) paired with pointers to child pages. Pages have sibling links to pages to their left and right on the same level of the tree. (The free list also uses sibling links.) These sibling links help speed up some tree update operations and sequential traversal. Updates that split a page can touch a potentially multiple pages, especially if parent pages must split (possibly up to the root), leading to opportunities for corruption if some of the writes fail (due to crashes or power failures). Often, corruption will result in some items being inaccessible to sequential access, random access, or both. Observed instances of corruption appear to be inconsistencies between parent-child links and sibling links. Rarely, loops will form in a sibling chain, causing infinite loops (at least until disk space exhaustion) during dumps.</div></td>
<td class="diff-marker">+</td>
<td style="color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>The btree back end is actually a [https://en.wikipedia.org/wiki/B%2B_tree B+tree]. Pages in the tree are either internal pages or leaf pages. There is also a free list (with its head pointer in the metadata page). Leaf pages contain key-value pairs in key-sorted order; internal pages contain keys (also sorted) paired with pointers to child pages. Pages have sibling links to pages to their left and right on the same level of the tree. (The free list also uses sibling links.) These sibling links help speed up some tree update operations and sequential traversal. Updates that split a page can touch a potentially multiple pages, especially if parent pages must split (possibly up to the root), leading to opportunities for corruption if some of the writes fail (due to crashes or power failures). Often, corruption will result in some items being inaccessible to sequential access, random access, or both. Observed instances of corruption appear to be inconsistencies between parent-child links and sibling links. Rarely, loops will form in a sibling chain, causing infinite loops (at least until disk space exhaustion) during dumps<ins class="diffchange diffchange-inline">. Sibling link corruption can lead to worse corruption if a page with a corrupted sibling link splits or is deleted</ins>.</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>Page splits of non-root pages always preserve the page being split as the left-hand page. Splitting the root creates two new pages and converts the root to an internal page if it started out as a leaf page.</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>Even though every page in a tree should have at least two children or items, some pages can end up with only one due to deletions. Only when a deletion would completely empty a page is that page deleted.</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"></td>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"><div>==Compatibility==</div></td>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"><div>==Compatibility==</div></td>
</tr>
</table>TomYuhttps://k5wiki.kerberos.org/wiki?title=Berkeley_DB_notes&diff=5653&oldid=prevTomYu at 15:51, 13 August 20162016-08-13T15:51:07Z<p></p>
<table class="diff diff-contentalign-left" data-mw="interface">
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr style='vertical-align: top;' lang='en'>
<td colspan='2' style="background-color: white; color:black; text-align: center;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black; text-align: center;">Revision as of 15:51, 13 August 2016</td>
</tr><tr>
<td colspan="2" class="diff-lineno">Line 1:</td>
<td colspan="2" class="diff-lineno">Line 1:</td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"><div>==Internals==</div></td>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"><div>==Internals==</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"></td>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"></td>
</tr>
<tr>
<td class="diff-marker">−</td>
<td style="color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>These notes mostly describe the btree back end for our special db-1.85/1.86 variant. Some of this material should go into the official documentation to help operators understand possible database corruption characteristics.</div></td>
<td class="diff-marker">+</td>
<td style="color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>These notes mostly describe the btree back end for our special db-1.85/1.86 variant<ins class="diffchange diffchange-inline">, as used for the "db2" KDB module</ins>. Some of this material should go into the official documentation to help operators understand possible database corruption characteristics.</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"></td>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"><div>Berkeley DB files are made up of fixed-size pages. Each back end (hash, btree) has its own way of organizing data into pages. Page zero of a database file is a metadata page that includes persistent parameters of the whole database.</div></td>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"><div>Berkeley DB files are made up of fixed-size pages. Each back end (hash, btree) has its own way of organizing data into pages. Page zero of a database file is a metadata page that includes persistent parameters of the whole database.</div></td>
</tr>
<tr>
<td colspan="2" class="diff-lineno">Line 9:</td>
<td colspan="2" class="diff-lineno">Line 9:</td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"><div>Berkley DB has a user-space page cache called mpool. Pages in the cache can be in-use/pinned (<code>MPOOL_PINNED</code>) and/or dirty (<code>MPOOL_DIRTY</code>). There is a hash table for pages in the cache, and an LRU queue to help release memory used by unreferenced pages.</div></td>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"><div>Berkley DB has a user-space page cache called mpool. Pages in the cache can be in-use/pinned (<code>MPOOL_PINNED</code>) and/or dirty (<code>MPOOL_DIRTY</code>). There is a hash table for pages in the cache, and an LRU queue to help release memory used by unreferenced pages.</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"></td>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"></td>
</tr>
<tr>
<td class="diff-marker"><a class="mw-diff-movedpara-left" href="#movedpara_5_1_rhs">⚫</a></td>
<td style="color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><a name="movedpara_3_0_lhs"></a><code>mpool_get</code> gets an existing page. It checks to see if the requested page is <del class="diffchange diffchange-inline">already</del> <del class="diffchange diffchange-inline">in</del> <del class="diffchange diffchange-inline">memory</del> <del class="diffchange diffchange-inline">in</del> <del class="diffchange diffchange-inline">a</del> cache <del class="diffchange diffchange-inline">and</del> <del class="diffchange diffchange-inline">not</del> <del class="diffchange diffchange-inline">pinned</del>. (<del class="diffchange diffchange-inline">It</del> will abort if the page is already pinned, unless the caller passes the <code>MPOOL_IGNOREPIN</code> flag). <del class="diffchange diffchange-inline">It</del> reads in the page from disk <del class="diffchange diffchange-inline">if</del> <del class="diffchange diffchange-inline">necessary</del>, <del class="diffchange diffchange-inline">and</del> sets <code>MPOOL_PINNED</code> unless the caller passes the <code>MPOOL_IGNOREPIN</code> flag. If the cache size is <del class="diffchange diffchange-inline">larger</del> <del class="diffchange diffchange-inline">than</del> <del class="diffchange diffchange-inline">a configured maximum</del>, <code>mpool_get</code> tries to reuse unreferenced cache entries when reading in a previously uncached page. It does this by <del class="diffchange diffchange-inline">walking</del> the <del class="diffchange diffchange-inline">LRU</del> <del class="diffchange diffchange-inline">queue</del> <del class="diffchange diffchange-inline">for</del> <del class="diffchange diffchange-inline">unpinned</del> <del class="diffchange diffchange-inline">pages</del> (which might be dirty, in which case it will flush <del class="diffchange diffchange-inline">them</del> to disk). If this fails to free up a cache entry, <code><del class="diffchange diffchange-inline">mpool_get</del></code> will allocate a new entry anyway, growing the cache beyond the <del class="diffchange diffchange-inline">nominal maximum size</del>.</div></td>
<td colspan="2" class="diff-empty"> </td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><code>mpool_open</code> opens a memory pool from an already-open file descriptor. It takes a <code>pagesize</code> and a <code>maxcache</code> parameter. <code>pagesize</code> is typically a persistent parameter of an upper layer (e.g., stored in the btree metadata page), while <code>maxcache</code> can be tuned for performance without changing persistent database parameters.</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker"><a class="mw-diff-movedpara-right" href="#movedpara_3_0_lhs">⚫</a></td>
<td style="color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><a name="movedpara_5_1_rhs"></a><code>mpool_get</code> gets an existing page. It checks to see if the requested page is <ins class="diffchange diffchange-inline">not</ins> <ins class="diffchange diffchange-inline">pinned,</ins> <ins class="diffchange diffchange-inline">and</ins> <ins class="diffchange diffchange-inline">retrieves</ins> <ins class="diffchange diffchange-inline">it from the</ins> cache <ins class="diffchange diffchange-inline">if</ins> <ins class="diffchange diffchange-inline">it's</ins> <ins class="diffchange diffchange-inline">already in the cache</ins>. (<ins class="diffchange diffchange-inline">If built with <code>-DDEBUG</code>, it</ins> will abort if the page is already pinned, unless the caller passes the <code>MPOOL_IGNOREPIN</code> flag). <ins class="diffchange diffchange-inline">If the page is not in cache, <code>mpool_get</code></ins> reads in the page from disk<ins class="diffchange diffchange-inline">.</ins> <ins class="diffchange diffchange-inline">Regardless</ins> <ins class="diffchange diffchange-inline">of how it obtained the page</ins>, <ins class="diffchange diffchange-inline"><code>mpool_get</code></ins> sets <code>MPOOL_PINNED</code> unless the caller passes the <code>MPOOL_IGNOREPIN</code> flag. If the<ins class="diffchange diffchange-inline"> current</ins> cache size is <ins class="diffchange diffchange-inline">at</ins> <ins class="diffchange diffchange-inline">least</ins> <ins class="diffchange diffchange-inline"><code>maxcache</code></ins>, <code>mpool_get</code> tries to reuse unreferenced cache entries when reading in a previously uncached page<ins class="diffchange diffchange-inline"> (through static helper <code>mpool_bkt</code>)</ins>. It does this by <ins class="diffchange diffchange-inline">getting</ins> the <ins class="diffchange diffchange-inline">first</ins> <ins class="diffchange diffchange-inline">unpinned</ins> <ins class="diffchange diffchange-inline">page</ins> <ins class="diffchange diffchange-inline">from</ins> <ins class="diffchange diffchange-inline">the LRU queue</ins> (which might be dirty, in which case it will flush <ins class="diffchange diffchange-inline">it</ins> to disk). If this fails to free up a cache entry, <code><ins class="diffchange diffchange-inline">mpool_bkt</ins></code> will allocate a new entry anyway, growing the cache beyond the <ins class="diffchange diffchange-inline"><code>maxcache</code></ins>.</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"></td>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"><div><code>mpool_put</code> releases a pinned page, clearing <code>MPOOL_PINNED</code>. The caller can pass the <code>MPOOL_DIRTY</code> flag to indicate that it modified the page and the page therefore needs to be written back to disk. Notably, <code>mpool_put</code> does not change the order of the LRU queue; only <code>mpool_get</code> does that, which means that the ordering of <code>mpool_put</code> calls doesn't determine the order in which mpool flushes dirty pages to disk. If a subsequent <code>mpool_get</code> fetches a dirty page before it is flushed to disk, the page moves to the tail of the LRU queue, possibly further delaying its being written to disk.</div></td>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"><div><code>mpool_put</code> releases a pinned page, clearing <code>MPOOL_PINNED</code>. The caller can pass the <code>MPOOL_DIRTY</code> flag to indicate that it modified the page and the page therefore needs to be written back to disk. Notably, <code>mpool_put</code> does not change the order of the LRU queue; only <code>mpool_get</code> does that, which means that the ordering of <code>mpool_put</code> calls doesn't determine the order in which mpool flushes dirty pages to disk. If a subsequent <code>mpool_get</code> fetches a dirty page before it is flushed to disk, the page moves to the tail of the LRU queue, possibly further delaying its being written to disk.</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><code>mpool_sync</code> writes dirty pages out to disk and calls <code>fsync</code>. The only part of the btree back end that calls it is <code>__bt_sync</code> (usually by way of <code>__bt_close</code>).</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"></td>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"><div>===btree===</div></td>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"><div>===btree===</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"></td>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"></td>
</tr>
<tr>
<td class="diff-marker">−</td>
<td style="color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>The btree back end is actually a [https://en.wikipedia.org/wiki/B%2B_tree B+tree]. Pages in the tree are either internal pages or leaf pages. There is also a free list. Leaf pages contain key-value pairs in key-sorted order; internal pages contain keys (also sorted) with pointers to <del class="diffchange diffchange-inline">lower level</del> pages. Pages have sibling links to pages to their left and right on the same level of the tree. (The free list also uses sibling links.) These sibling links help speed up some tree update operations and sequential traversal. Updates that split a page can touch a potentially multiple pages, especially if parent pages must split (possibly up to the root), leading to opportunities for corruption if some of the writes fail. Often, corruption will result in some items being inaccessible to sequential access, random access, or both. Rarely, loops will form in a sibling chain, causing infinite loops (at least until disk space exhaustion) during dumps.</div></td>
<td class="diff-marker">+</td>
<td style="color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>The btree back end is actually a [https://en.wikipedia.org/wiki/B%2B_tree B+tree]. Pages in the tree are either internal pages or leaf pages. There is also a free list<ins class="diffchange diffchange-inline"> (with its head pointer in the metadata page)</ins>. Leaf pages contain key-value pairs in key-sorted order; internal pages contain keys (also sorted)<ins class="diffchange diffchange-inline"> paired</ins> with pointers to <ins class="diffchange diffchange-inline">child</ins> pages. Pages have sibling links to pages to their left and right on the same level of the tree. (The free list also uses sibling links.) These sibling links help speed up some tree update operations and sequential traversal. Updates that split a page can touch a potentially multiple pages, especially if parent pages must split (possibly up to the root), leading to opportunities for corruption if some of the writes fail<ins class="diffchange diffchange-inline"> (due to crashes or power failures)</ins>. Often, corruption will result in some items being inaccessible to sequential access, random access, or both<ins class="diffchange diffchange-inline">. Observed instances of corruption appear to be inconsistencies between parent-child links and sibling links</ins>. Rarely, loops will form in a sibling chain, causing infinite loops (at least until disk space exhaustion) during dumps.</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"></td>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"><div>==Compatibility==</div></td>
<td class="diff-marker"> </td>
<td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"><div>==Compatibility==</div></td>
</tr>
</table>TomYuhttps://k5wiki.kerberos.org/wiki?title=Berkeley_DB_notes&diff=5652&oldid=prevTomYu: Created page with "==Internals== These notes mostly describe the btree back end for our special db-1.85/1.86 variant. Some of this material should go into the official documentation to help ope..."2016-08-12T22:37:20Z<p>Created page with "==Internals== These notes mostly describe the btree back end for our special db-1.85/1.86 variant. Some of this material should go into the official documentation to help ope..."</p>
<p><b>New page</b></p><div>==Internals==<br />
<br />
These notes mostly describe the btree back end for our special db-1.85/1.86 variant. Some of this material should go into the official documentation to help operators understand possible database corruption characteristics.<br />
<br />
Berkeley DB files are made up of fixed-size pages. Each back end (hash, btree) has its own way of organizing data into pages. Page zero of a database file is a metadata page that includes persistent parameters of the whole database.<br />
<br />
===mpool===<br />
<br />
Berkley DB has a user-space page cache called mpool. Pages in the cache can be in-use/pinned (<code>MPOOL_PINNED</code>) and/or dirty (<code>MPOOL_DIRTY</code>). There is a hash table for pages in the cache, and an LRU queue to help release memory used by unreferenced pages.<br />
<br />
<code>mpool_get</code> gets an existing page. It checks to see if the requested page is already in memory in a cache and not pinned. (It will abort if the page is already pinned, unless the caller passes the <code>MPOOL_IGNOREPIN</code> flag). It reads in the page from disk if necessary, and sets <code>MPOOL_PINNED</code> unless the caller passes the <code>MPOOL_IGNOREPIN</code> flag. If the cache size is larger than a configured maximum, <code>mpool_get</code> tries to reuse unreferenced cache entries when reading in a previously uncached page. It does this by walking the LRU queue for unpinned pages (which might be dirty, in which case it will flush them to disk). If this fails to free up a cache entry, <code>mpool_get</code> will allocate a new entry anyway, growing the cache beyond the nominal maximum size.<br />
<br />
<code>mpool_put</code> releases a pinned page, clearing <code>MPOOL_PINNED</code>. The caller can pass the <code>MPOOL_DIRTY</code> flag to indicate that it modified the page and the page therefore needs to be written back to disk. Notably, <code>mpool_put</code> does not change the order of the LRU queue; only <code>mpool_get</code> does that, which means that the ordering of <code>mpool_put</code> calls doesn't determine the order in which mpool flushes dirty pages to disk. If a subsequent <code>mpool_get</code> fetches a dirty page before it is flushed to disk, the page moves to the tail of the LRU queue, possibly further delaying its being written to disk.<br />
<br />
===btree===<br />
<br />
The btree back end is actually a [https://en.wikipedia.org/wiki/B%2B_tree B+tree]. Pages in the tree are either internal pages or leaf pages. There is also a free list. Leaf pages contain key-value pairs in key-sorted order; internal pages contain keys (also sorted) with pointers to lower level pages. Pages have sibling links to pages to their left and right on the same level of the tree. (The free list also uses sibling links.) These sibling links help speed up some tree update operations and sequential traversal. Updates that split a page can touch a potentially multiple pages, especially if parent pages must split (possibly up to the root), leading to opportunities for corruption if some of the writes fail. Often, corruption will result in some items being inaccessible to sequential access, random access, or both. Rarely, loops will form in a sibling chain, causing infinite loops (at least until disk space exhaustion) during dumps.<br />
<br />
==Compatibility==<br />
<br />
Our bundled "db2" is actually probably db-1.86 with patches. db-1.85 was the version included by most BSD derivatives. db-1.86 made backward-incompatible changes in the hash back end (but not probably not the btree back end, though we haven't confirmed that).<br />
<br />
There's some issue that prevents krb5_db2_promote_db (and thus kdb5_util load) from working properly with db-5.3, and possibly earlier releases.</div>TomYu