Projects/Graceful recovery after destructive service rekey
For some period of time after a service is re-keyed, clients may have cached credentials using the old version of the service key. Ideally, the server will retain the old key versions for as long as there might be outstanding tickets. If a server is completely re-provisioned or replaced, it may be impractical to retain the old keytab. In this case, the usual practice is for users to manually refresh their ticket caches (by running kinit) to discard service tickets, which is inconvenient. It would be better if the client library code could automatically discard the cached service ticket and get a new one.
There are several obstacles to solving this problem as gracefully as we might like:
- The client only discovers that its service ticket is out of date when it receives a KRB_AP_ERR_BADKEYVER error from the receiver. At this point, to retry within the library, we would need to extend RFC 4121, with consequent issues of endpoint negotiation and application compatibility. Without extending the krb5 mech, the best we can do is make the next attempt at authentication work.
- Since 1.7, the server returns KRB_AP_WRONG_PRINC instead of KRB_AP_ERR_BADKEYVER on a key version mismatch; see [krbdev.mit.edu #7232]. This can be fixed for simple cases, but if the ticket uses an alias for the server, it is not always possible to tell whether the client presented a ticket with the wrong kvno or just a ticket for the wrong server principal. (Update: the simple case will be addressed in 1.13 by [krbdev.mit.edu #7232].)
- There is no authentication of errors in an AP exchange, so we have to consider the possibility of an attacker forcing the client to discard its service ticket, although there are probably no interesting attacks.
- Most current ccache types do not support a way to remove a credential, and it is difficult to do so within a file ccache (although Heimdal has an approach where the credential is modified in a way that makes it unlikely to be matched). If we try to work around this by reinitializing the ccache and copying all of the other credentials, we run into [krbdev.mit.edu #7707].
- If there is propagation delay between master and replica KDCs, re-fetching the service ticket may just get us another out-of-date ticket, unless we somehow make sure to use the master KDC (which is not supported by current APIs). This problem could be considered out of scope.
- We could be in a situation where the KDC consistently gives us a service ticket which produces a KRB5_AP_ERR_BADKEYVER result, no matter how many times we refresh it. This could happen because of the above problem or because of a misconfiguration. If we discard the service ticket over and over again, we could generate a lot of TGS requests we wouldn't otherwise generate. From the KDC's perspective this is not necessarily any worse than the performance issues which result from not having negative caching of TGS requests, but on the client it could also cause a file ccache to grow each time we try to authenticate.
- The above scenario could be mitigated by annotating a service ticket when it works, and then only discarding it if the ticket if it worked at least once and then started to fail. The benefit is that we would avoid retry loops; the cost is that we wouldn't get graceful recovery in the unlikely event that a service is re-keyed between the client getting a service ticket and its first use. The ccache API currently provides no way to annotate a ticket, so we would have to add that capability or abuse ccache config entries to fake it.