Difference between revisions of "Projects/Replay cache improvements"

Revision as of 21:31, 20 February 2019

This is an early stage project for MIT Kerberos. It is being fleshed out by its proponents. Feel free to help flesh out the details of this project. After the project is ready, it will be presented for review and approval.

Background

MIT krb5 implements a replay cache and uses it by default. Replay caches provide a limited defense against certain kinds of active attacks against some protocols. The current replay cache implementation has severe performance limitations as well as flaws which can cause both false positives and false negatives. Many server deployments disable the replay cache as a result.

Synopsis of current implementation

The replay cache stores replay records which contain server and client principal names, an authenticator ciphertext hash, and an authenticator timestamp. The ciphertext hash is not part of the file format, so for backward compatibility, two records are stored for each authenticator, one of which encodes the hash, client, and server principal names in the nominal server name field. Records are generally small, but are not fixed-size because principal names vary in length.

When a replay cache handle is opened, we attempt to open an existing replay cache file and read in all of its current entries into an in-memory hash table. If we fail to open the file or read any of its entries, we attempt to initialize the file. If we successfully read all of the entries but detect more than 30 expired entries, we attempt to expunge the file. To initialize the file, we unlink it (ignoring errors), open it with O_CREAT|O_EXCL, and write a header consisting of a two-byte version number and a four-byte lifespan. To expunge the file, we close the file descriptor, initialize the file, then write all non-expired entries from the hash table.

For each authentication, the replay record is checked against the memory hash table for a conflict, then added to the table, then marshalled and written to the file. Based on a heuristic, we may choose to expunge the file; otherwise we fsync() it to ensure that no replay records are lost due to a system reboot.

Correctness

The current implementation has the following possible issues with correctness:

There is a race condition when creating a replay cache file, or replacing one after detecting corruption ([krbdev.mit.edu #3498]).

On Windows, races while expunging the replay cache can cause temporary files to be left behind, which eventually exhausts the space in the directory.

There is a possibility for file corruption due to lack of locking and not using O_APPEND ([krbdev.mit.edu #1671]). If two processes nearly simultaneously write at the same file offset, the writes will overlap unless O_APPEND is set.

Once a replay cache handle is open, entries written by other processes (or through other handles) will not be detected. Expunges performed through other handles will also not be detected, and the current handle will continue to write entries to the unlinked file.

Performance

The current implementation performs an fsync() on every write ([krbdev.mit.edu #372]). Unless the replay cache is located in a memory filesystem, this is always the limiting performance factor.

Since the current format is not in any way sorted, a newly opened reply cache handle must read all entries in the file to detect a conflict. For high-traffic situations, this is performance-limiting if fsync is not. If a single replay cache handle is used for many authentications, this problem does not apply because the replay cache implementation (incorrectly) does not read new entries.

The current replay cache implementation does not use any file locking. Although this causes correctness issues, it means that contention between processes is not performance-limiting.

Improvements

Short-term solutions may include:

DONE: [krbdev.mit.edu #3498] Fix the file creation race somehow to avoid spurious failures, perhaps by unlinking and retrying on failure to open.
Start using O_APPEND and detect when other processes have written entries.
Detect expunges by other processes.
Start using file locking.
Stop using fsync, at least for most records.

Medium-term solutions may include:

New file-based replay cache, possibly hash table based.
IPC-based replay cache for higher performance.

Long-term solutions may include:

Revise protocols to not require replay caches for security

fsync avoidance

The simplest way to avoid using fsync() is to decide that it's okay to potentially lose replay records due to a system reboot. Since replay caches can never be perfect, this might be an acceptable limitation of the implementation.

Nico has suggested only fsyncing records whose authenticator timestamps might postdate the next reboot, and reject authenticators whose timestamps predate the current boot. Potential issues with this idea include:

If a non-trivial percentage of incoming authenticators are timestamped in the future by more than the reboot time estimate, performance might still be limited by fsync.
If the server clock drifts into the past by more than the reboot time estimate, the replay cache becomes slow.
Rejecting authenticators timestamped before the current boot could result in many spurious failures just after reboot.

Hash-based file format

See http://mailman.mit.edu/pipermail/krbdev/2014-September/012166.html for initial notes.

Replay records can be made small and fixed-size. All we need is a tag to determine a replay match, and a timestamp to determine if a record is expired. We can optionally discard some of the low-order and high-order timestamp bits, as it is not harmful to occasionally mistreat an expired record as current. The tag bytes can be taken from the integrity tag of the ciphertext (for authenticators or mk_priv messages), or from the MIC of a mk_safe message. If the tag is N bits long, we expect false positives to occur when around 2^(N/2) replay records are live during the replay window. For practical purposes, a 96-bit or longer tag is unlikely to create any issues with false positives; a 56-bit tag (in order to produce a 64-bit record when combined with 8 timestamp bits) is more aggressive, but still probably okay.

With fixed-sized records, we can store a hash table in a file by hashing the tag down to an index modulo some value, and seeking to the record size times the index. If the record we find there matches the tag, we report a replay. If the record there is empty or expired, we can store the new record there.

If the modulus is M, we expect a collision when approximately sqrt(M) records are simultaneously current. We could handle a small number of collision using some kind of open-addressing scheme, but the fundamental strategy is to create a new hash table at the end of the current one. The file size can be used to detect the number of hash tables in the file. Once multiple tables are present in the file, we must check all of the tables when looking up records. If we don't find a match, we can store the new record in the first empty or expired slot. If all of the slots are filled with unexpired records, we can create a new hash table.

An authenticated attacker could potentially provoke collisions must faster than we would expect with an even distribution of integrity tags, using offline computation to find valid integrity tags (by varying the confounder) which all hash to the same value for each modulus. A seeded hash function could mitigate this attack, at the expense of having to allocate some space in the file to store the seed.

Some care must be taken to avoid races between looking up a record and storing it, or else two processes could store the same record without detecting that one of them is a replay. The simplest strategy is to exclusively lock the whole file before lookup and release the lock after storing the new entry. Byte-range locks can be used to improve concurrency, at a cost in complexity and system call overhead. Migrating the implementation from whole-file locking to per-record locking or vice-versa is safe, as a whole-file lock will contend with all per-record locks. If records are 64 bits, it might be possible to use mmap() and compare-and-swap instructions to avoid locks entirely, at a cost in both portability and complexity.

To produce some experimental test results, a harness was created which spawns 1000 processes to each store 1000 unique tags at the same timestamp. As each process completes, the master process stores the same 1000 tags to verify that they are all detected as replays. The basic implementation used 128-bit records, hash table sizes beginning at 1024 slots and doubling with each new hash table, no open addressing, and whole-file locking.

With the basic implementation, the nominal file size after the test was 126MB, or 132 bytes (8.2 table slots) per record. On a filesystem supporting holes, the space used was 59MB, or 61 bytes (3.8 table slots) per record. The test took about 32s on a Linux VM running on unknown hardware, or about 15us per store operation (with half the store operations writing records and the other half finding replays).

With 2-slot linear open addressing added (each table size is reduced by one; a record hashes to its index and the next index in each table), the nominal file size was reduced to 64MB (67 bytes per record, 4.2 table slots) and the space used to 35M (36 bytes per record, 2.2 table slots). The runtime was reduced to about 30s. For larger amounts of open addressing, the amount of space used declined as far as 32MB, and the nominal size to 32MB as well at 5-slot open addressing (by fitting all of the records in eleven tables instead of twelve). The runtime remained at 29-30s for any reasonably small amount of open addressing.

With the whole-file locks replaced with per-record (byte range) locking, the runtime was increased to about 38s. The VM reported only having one CPU, so the tests were re-run on a bare-metal machine with four cores; the record-locking variant still had a proportionally simliar runtime increase versus whole-file locking. It appears that the concurrency advantages of per-record locking are outweighed by the increased system call overhead. (2-slot open addressing was also present for this test.)

With the record size reduced to 8 bytes (56 bits for the tag, one for the timestamp divided by the clockskew modulo 256) and the initial table size increased to 2048, the nominal file size was reduced to 32MB (33 bytes per record, 4.2 table slots) and the space used to 19MB (19 bytes per record, 2.4 table slots). The runtime was reduced to about 29s. (2-slot open addressing was also present for this test, but per-record locking was not.)

Transition to a new format

We believe, but have not rigorously confirmed, that no other Kerberos implementation implements the MIT krb5 replay cache format. A transition to a new replay cache format must still take into account that old replay cache files may exist after an upgrade, and that old and new versions of MIT krb5 might be used on the same system (e.g. the native OS version and a local build of a more recent version).

The simplest transition strategy is to change the filename picked by krb5_get_server_rcache, and decide that old and new versions do not share replay records.

Other discussions

   http://colabti.org/irclogger/irclogger_log/krbdev?date=2014-08-26
   https://lists.anl.gov/pipermail/ietf-krb-wg/2011-November/003241.html
   http://mailman.mit.edu/pipermail/krbdev/2014-September/012163.html
   http://mailman.mit.edu/pipermail/krbdev/2014-September/012164.html
   http://mailman.mit.edu/pipermail/krbdev/2014-September/012165.html
   http://mailman.mit.edu/pipermail/krbdev/2014-September/012166.html

@@ Line 76: / Line 76: @@
 * With the basic implementation, the nominal file size after the test was 126MB, or 132 bytes (8.2 table slots) per record.  On a filesystem supporting holes, the space used was 59MB, or 61 bytes (3.8 table slots) per record.  The test took about 32s on a Linux VM running on unknown hardware, or about 15us per store operation (with half the store operations writing records and the other half finding replays).
-* With basic open addressing added (each table size is reduced by one, and a record hashes to its index and the next index in each table), the nominal file size was reduced to 64MB (67 bytes per record, 4.2 table slots) and the space used to 35M (36 bytes per record, 2.2 table slots).  The runtime was reduced to about 30s.
+* With 2-slot linear open addressing added (each table size is reduced by one; a record hashes to its index and the next index in each table), the nominal file size was reduced to 64MB (67 bytes per record, 4.2 table slots) and the space used to 35M (36 bytes per record, 2.2 table slots).  The runtime was reduced to about 30s.  For larger amounts of open addressing, the amount of space used declined as far as 32MB, and the nominal size to 32MB as well at 5-slot open addressing (by fitting all of the records in eleven tables instead of twelve).  The runtime remained at 29-30s for any reasonably small amount of open addressing.
-* With the whole-file locks replaced with per-record (byte range) locking, the runtime was increased to about 38s.  The VM reported only having one CPU, so the tests were re-run on a bare-metal machine with four cores; the record-locking variant still had a proportionally simliar runtime increase versus whole-file locking.  It appears that the concurrency advantages of per-record locking are outweighed by the increased system call overhead.  (Basic open addressing was also present for this test.)
+* With the whole-file locks replaced with per-record (byte range) locking, the runtime was increased to about 38s.  The VM reported only having one CPU, so the tests were re-run on a bare-metal machine with four cores; the record-locking variant still had a proportionally simliar runtime increase versus whole-file locking.  It appears that the concurrency advantages of per-record locking are outweighed by the increased system call overhead.  (2-slot open addressing was also present for this test.)
-* With the record size reduced to 8 bytes (56 bits for the tag, one for the timestamp divided by the clockskew modulo 256) and the initial table size increased to 2048, the nominal file size was reduced to 32MB (33 bytes per record, 4.2 table slots) and the space used to 19MB (19 bytes per record, 2.4 table slots).  The runtime was reduced to about 29s.  (Basic open addressing was also present for this test, but per-record locking was not.)
+* With the record size reduced to 8 bytes (56 bits for the tag, one for the timestamp divided by the clockskew modulo 256) and the initial table size increased to 2048, the nominal file size was reduced to 32MB (33 bytes per record, 4.2 table slots) and the space used to 19MB (19 bytes per record, 2.4 table slots).  The runtime was reduced to about 29s.  (2-slot open addressing was also present for this test, but per-record locking was not.)
 ===Transition to a new format===

Difference between revisions of "Projects/Replay cache improvements"

Revision as of 21:31, 20 February 2019

Contents

Background

Synopsis of current implementation

Correctness

Performance

Improvements

fsync avoidance

Hash-based file format

Transition to a new format

Other discussions

Navigation menu

Views

Personal tools

Navigation

Search

Tools