Notes on converting from Subversion to Git
Most likely, bulk conversion of the Subversion repository will be done with git svn clone, as it was done for our krb5-anonsvn repository on github.
Because of the way git uses SHA-1 hashes, it is desirable to make the conversion as clean as possible now, as doing any cleanup later will be disruptive to downstream repositories. Because of our project's long history and historical lack of consistent code documentation, our version control history is a valuable investigative tool, and effort applied to cleaning up our history during conversion will pay dividends in the future.
Log message cleanup
Subversion commit authors are simply repository-local usernames. git commit authors contain a real name and username. git svn clone supports a map file mapping Subversion authors to git authors. We should prepare a map file containing the names of all historical Subversion authors, using @mit.edu addresses for all usernames.
In git repositories, it is common for the author and date of a commit to reflect the original authorship even if the author doesn't have write access to the repository, and a Signed-off-by: line to indicate who pushed the change. This practice was not possible in Subversion, so our Subversion log metadata reflects the actual committer. It would probably be too much effort to remedy this mismatch at conversion time.
It is recommended practice in git to begin each commit message with a summary line of less than 50 characters, followed by a blank line. Some tools, such as git log --oneline, assume this practice. It is probably worth some amount of effort to make most of our converted log messages follow this practice.
Many of our commit messages begin with RT metadata headers. These should be moved to the end of the message so they do not appear in summar lines. If a message contains a Subject: header for the new ticket, the contents of that header can be copied to the front in order to create a summary.
RT ticket numbers
A Subversion commit containing a "ticket: new" line creates a new RT ticket. As of recently, such log messages are rewritten to reference the newly created ticket number instead of "new". It may be worthwhile to rewrite old log messages as part of the conversion. (XXX but how to find the ticket numbers automatically?)
git-svn creates trailing "git-svn-id:" lines in log messages. Some conversion guides recommend using the --no-metadata to omit these lines for tidiness, but this would make it harder to track down references to Subversion revision numbers in commit logs and RT tickets. So we should probably keep these, or at most rewrite them to include only the revision number.
Log messages can be modified after conversion (but before the converted repository is made public) using git filter-branch, or the messages can be changed in the Subversion repository prior to conversion. Converting log messages using git filter-branch is inherently somewhat slower due to the need to recompute SHA-1 hashes on all subsequent revisions.
If we do any manual editing to create summary lines, we will want the process to be as efficient as possible due to the number of commits in our repository (20,000 or so). We could develop a simple tool which would allow a developer to edit a "git log" or "svn log" display, and then use the edited log to transform log messages in the pre-conversion or post-conversion repository. A tool to identify which revisions in a log need attention would also be helpful.
Content outside of standard Subversion layout
Our Subversion repository contains some content outside of the standard trunk/tags/branches layout. By default git-svn will ignore this content. Branches can be added to the git-svn conversion by specifying multiple -b options to git svn clone; alternatively, the svn repository could be preprocessed (via dump and load) to move particular subdirectories into the branches directory. Branches can be easily renamed after conversion. (XXX How does git-svn handle conflicts in branch names during conversion?)
The top-level tools directory contains local work on external testing tools (gssmonger and gsstest). It is likely best to ignore these and create separate repositories for them.
The top-level ChangeLogs directory contains copies of the ChangeLog files which were removed from the tree in r17893 (2006-04-11). It is likely best to ignore this as well. The history of these deleted files will be present in the converted repository, and it is easier to access historical versions of deleted ChangeLog files in git than in svn.
The top-level users directory contains semi-private branches for individual developers. Branches which may be worth including in the converted repository include:
- users/amb/referrals/trunk -- Client-side referrals (some notes are also present in the parent, but can probably be ignored in the conversion)
- users/coffman/gic_opt_ext -- Extensible gic_opt structures
- users/coffman/keyring -- Linux keyring ccache support
- users/coffman/pkinit -- PKINIT preauth module
- users/hartmans/pkinit -- Apple PKINIT
- users/hartmans/fast -- FAST negotiation support
- users/lhoward/aes-ccm -- AES-CCM cipher support (never merged due to issues with CCM and Kerberos)
- users/lhoward/authdata -- authdata pluggable interface
- users/lhoward/camellia-ccm -- Camellia-CCM cipher support (a predecessor of branches/camellia-ccm)
- users/lhoward/gssextras -- Miscellaneous GSSAPI extensions
- users/lhoward/gssextras-no-cqa -- A successor to the gssextras branch with one extension removed
- users/lhoward/heimmig -- HDB KDB module
- users/lhoward/iakerb-libkrb5-as-only -- krb5_init_creds_step and related APIs
- users/lhoward/iakerb-refonly -- IAKERB support, without full support for the TGS path
- users/lhoward/import-cred -- gss_krb5_import_cred support
- users/lhoward/lockout -- Account lockout support
- users/lhoward/lockout2 -- A successor to the lockout branch with the explicit lockout time attribute removed
- users/lhoward/moonshot-mechglue-fixes -- Miscellaneous mechglue changes useful for Moonshot
- users/lhoward/namingexts-mechglue -- gss_export_name_composite support
- users/lhoward/s4u -- S4U2Self (protocol transition) and S4U2Proxy (constrained delegation) support
- users/lhoward/s4u2proxy -- PAC-less S4U2Proxy support
- users/lhoward/sasl-gs2 -- gss_inquire_saslname_for_mech/gss_inquire_mech_for_saslname support
- users/lhoward/signedpath-naming-exts -- naming extensions support for signedpath authdata
- users/raeburn/branches/network-merge -- shared code for kadmind and KDC network loop
- users/raeburn/branches/syms -- restricted export lists for shared libraries
Branch and tag cleanup
Some of the branches in the repository are relics of the cvs-to-svn conversion and should be removed. These branch names begin with "unlabeled".
Tags in the converted repository will be expressed as branches with names beginning with "tags/". These should be converted to git tags. (XXX do any tags have commits on them, e.g. to change version numbers? If so, how should we handle them?)
- RT support (headers at end; what to do with "ticket: new"?)
- Push to github mirror
To support the "ticket: new" capability, we currently rewrite the log messages of the commits on the svn server. There is no easy way to have a server-side git hook rewrite the commits it receives without exposing some intermediate state. The post-receive hook could rewrite the commit messages, but it would have to move the pushed branch head to point at the rewritten commit stream, because the post-receive hook runs after the server updates the refs. "git push" doesn't automatically rebase from the server's state; the pushing client would have to have a hook to rebase its branches after this event. A second client that contacts the server between when the original pushed ref update occurs on the server and when the post-rewrite refs are exposed would also have to rebase, which could be inconvenient.
For ticket numbers, git notes are an alternative that some people have suggested. They have the disadvantage of not being immutable, and might not clone so well across repositories. They're also moderately new, and it's plausible that some older OS distributions of git won't support them.
The existing github repository appears to have some revisions with invalid metadata which cause git filter-branch -- --all to abort. These arise from Subversion revisions like r8987 which were manufactured by cvs2svn to create tags, and do not have useful author or date fields. If we can identify all such revisions, we can work around the problem by editing the revision properties on the Subversion repository prior to conversion. Alternatively, we can remove those revisions from the git repository as they should all be leaves for very old tags.