Notes on converting from Subversion to Git
Most likely, bulk conversion of the Subversion repository will be done with git svn clone, as it was done for our krb5-anonsvn repository on github.
Because of the way git uses SHA-1 hashes, it is desirable to make the conversion as clean as possible now, as doing any cleanup later will be disruptive to downstream repositories. Because of our project's long history and historical lack of consistent code documentation, our version control history is a valuable investigative tool, and effort applied to cleaning up our history during conversion will pay dividends in the future.
Log message cleanup
Subversion commit authors are simply repository-local usernames. git commit authors contain a real name and username. git svn clone supports a map file mapping Subversion authors to git authors. We should prepare a map file containing the names of all historical Subversion authors, using @mit.edu addresses for all usernames.
In git repositories, it is common for the author and date of a commit to reflect the original authorship even if the author doesn't have write access to the repository, and a Signed-off-by: line to indicate who pushed the change. This practice was not possible in Subversion, so our Subversion log metadata reflects the actual committer. It would probably be too much effort to remedy this mismatch at conversion time.
It is recommended practice in git to begin each commit message with a summary line of less than 50 characters, followed by a blank line. Some tools, such as git log --oneline, assume this practice. It is probably worth some amount of effort to make most of our converted log messages follow this practice.
Many of our commit messages begin with RT metadata headers. These should be moved to the end of the message so they do not appear in summar lines. If a message contains a Subject: header for the new ticket, the contents of that header can be copied to the front in order to create a summary.
RT ticket numbers
A Subversion commit containing a "ticket: new" line creates a new RT ticket. As of recently, such log messages are rewritten to reference the newly created ticket number instead of "new". It may be worthwhile to rewrite old log messages as part of the conversion. (XXX but how to find the ticket numbers automatically? There are 760 of these, plus 294 pullups, so doing it by hand would be prohibitive.)
git-svn creates trailing "git-svn-id:" lines in log messages. Some conversion guides recommend using the --no-metadata to omit these lines for tidiness, but this would make it harder to track down references to Subversion revision numbers in commit logs and RT tickets. So we should probably keep these, or at most rewrite them to include only the revision number.
Log messages can be modified after conversion (but before the converted repository is made public) using git filter-branch, or the messages can be changed in the Subversion repository prior to conversion. Converting log messages using git filter-branch is inherently somewhat slower due to the need to recompute SHA-1 hashes on all subsequent revisions.
If we do any manual editing to create summary lines, we will want the process to be as efficient as possible due to the number of commits in our repository (20,000 or so). We could develop a simple tool which would allow a developer to edit a "git log" or "svn log" display, and then use the edited log to transform log messages in the pre-conversion or post-conversion repository. A tool to identify which revisions in a log need attention would also be helpful.
Content outside of standard Subversion layout
Our Subversion repository contains some content outside of the standard trunk/tags/branches layout. By default git-svn will ignore this content. Branches can be added to the git-svn conversion by specifying multiple -b options to git svn clone; alternatively, the svn repository could be preprocessed (via dump and load) to move particular subdirectories into the branches directory. Although git-svn does not gracefully handle branch name collisions by default, we can work around this problem by editing .git/config before the first fetch, as documented in the CAVEATS section of the git-svn man page. Branches can be easily renamed after conversion.
The top-level tools directory contains local work on external testing tools (gssmonger and gsstest). It is likely best to ignore these and create separate repositories for them.
The top-level ChangeLogs directory contains copies of the ChangeLog files which were removed from the tree in r17893 (2006-04-11). It is likely best to ignore this as well. The history of these deleted files will be present in the converted repository, and it is easier to access historical versions of deleted ChangeLog files in git than in svn.
The top-level users directory contains semi-private branches for individual developers. Branches which may be worth including in the converted repository include:
- users/amb/referrals/trunk -- Client-side referrals (some notes are also present in the parent, but can probably be ignored in the conversion)
- users/coffman/gic_opt_ext -- Extensible gic_opt structures
- users/coffman/keyring -- Linux keyring ccache support
- users/coffman/pkinit -- PKINIT preauth module
- users/hartmans/fast -- FAST support
- users/hartmans/fast-negotiate -- FAST negotiation support
- users/lhoward/aes-ccm -- AES-CCM cipher support (never merged due to issues with CCM and Kerberos)
- users/lhoward/authdata -- authdata pluggable interface
- users/lhoward/camellia-ccm -- Camellia-CCM cipher support (a predecessor of branches/camellia-ccm)
- users/lhoward/gssextras -- Miscellaneous GSSAPI extensions
- users/lhoward/gssextras-no-cqa -- A successor to the gssextras branch with one extension removed
- users/lhoward/heimmig -- HDB KDB module
- users/lhoward/iakerb-libkrb5-as-only -- krb5_init_creds_step and related APIs
- users/lhoward/iakerb-refonly -- IAKERB support, without full support for the TGS path
- users/lhoward/import-cred -- gss_krb5_import_cred support
- users/lhoward/lockout -- Account lockout support
- users/lhoward/lockout2 -- A successor to the lockout branch with the explicit lockout time attribute removed
- users/lhoward/moonshot-mechglue-fixes -- Miscellaneous mechglue changes useful for Moonshot
- users/lhoward/namingexts-mechglue -- gss_export_name_composite support
- users/lhoward/s4u -- S4U2Self (protocol transition) and S4U2Proxy (constrained delegation) support
- users/lhoward/s4u2proxy -- PAC-less S4U2Proxy support
- users/lhoward/sasl-gs2 -- gss_inquire_saslname_for_mech/gss_inquire_mech_for_saslname support
- users/lhoward/signedpath-naming-exts -- naming extensions support for signedpath authdata
- users/raeburn/branches/network-merge -- shared code for kadmind and KDC network loop
- users/raeburn/branches/syms -- restricted export lists for shared libraries
Branch and tag cleanup
Some of the branches in the repository are relics of the cvs-to-svn conversion and should be removed. These branch names begin with "unlabeled".
Tags in the converted repository will be expressed as branches with names beginning with "tags/". These should be converted to git tags.
Some branches in /branches or the above list are not correctly copied from trunk:
- branches/referrals (copied from users/amb/referrals which is a parent of a trunk copy)
- users/raeburn/branches/syms (copied from trunk/src)
- users/raeburn/branches/plugin (copied from trunk/src)
The referrals branch can probably be fixed using git filter-branch. git-svn does not behave well when it sees a branch copied from trunk/src (it fetches another copy of the history of src and places it on the branch), so the branches under users/raeburn will probably need to be omitted.
For better readability of "git branch" and "git tag -l" output, we probably want to remove branches and tags which aren't likely to be useful--very old branches and tags, tags intended for use as anchors for CVS merges, and KfM branches and tags (which are very numerous). If it turns out they are needed, all branches and tags will still be available in the frozen Subversion repository.
Our release branches and tags have names like krb5-X-Y, because CVS did not allow periods in branch names. We can rename these to krb5-X.Y during conversion. We can also rename the old 1.0 branches and tags (which have names like V1_0_BRANCH) to match the naming scheme of later branches.
Our current Subversion hooks perform the following tasks:
- Check whitespace before accepting a commit
- Translate "ticket: new" to "ticket: NNNN" in a commit log message
- Send email (usually with a diff) to email@example.com for each commit
- Update a copy of the repository in the krbdev locker (used by anonsvn.mit.edu)
- Send the commit log through rt-cvsgate on krbdev.mit.edu to process RT headers
We will want to perform similar tasks for the git repository, although the "ticket: new" translation may be impractical (see below). In addition, we may want to add checks to ensure that committers are recorded in commit logs when they are not the original authors of the commits (XXX investigate best practices for this).
rt-cvsgate expects RT headers to be at the beginning of the commit log. We should modify it to accept headers at the end of the message as well.
RT "ticket: new" support
To support the "ticket: new" capability, we currently rewrite the log messages of the commits on the svn server. There is no easy way to have a server-side git hook rewrite the commits it receives without exposing some intermediate state. The post-receive hook could rewrite the commit messages, but it would have to move the pushed branch head to point at the rewritten commit stream, because the post-receive hook runs after the server updates the refs. "git push" doesn't automatically rebase from the server's state; the pushing client would have to have a hook to rebase its branches after this event. A second client that contacts the server between when the original pushed ref update occurs on the server and when the post-rewrite refs are exposed would also have to rebase, which could be inconvenient.
For ticket numbers, git notes are an alternative that some people have suggested. They have the disadvantage of not being immutable, and might not clone so well across repositories. They're also moderately new, and it's plausible that some older OS distributions of git won't support them.
A simple option is to make a script for developers to run to allocate a ticket (out of the krbdev locker on Athena, or copied to the developer's homedir). The script would output a ticket number which the developer would put in the "ticket:" header of the commit log.
Scripts using svn
- util/mkrel uses svn to export a release tree.
- util/find-missing-eol-prop and util/fix-eol-prop use svn, but should be unnecessary after conversion to git.
- The nightly Coverity build script uses svn to maintain a copy of the trunk
- buildbot configuration uses svn, but can be converted to use git
- cron jobs use svn to generate snapshots, but could be consolidated into the buildbot setup
- cron job to update github can be mostly superseded by a git hook, but there should still be a cron job in case there is a transient failure to push to github.
The existing github repository appears to have some revisions with invalid metadata which cause git filter-branch -- --all to abort. These arise from Subversion revisions like r8987 which were manufactured by cvs2svn to create tags, and do not have useful author or date fields. If we can identify all such revisions, we can work around the problem by editing the revision properties on the Subversion repository prior to conversion. Alternatively, we can remove those revisions from the git repository as they should all be leaves for very old tags.