Fraser Speirs Cocoa and Photos

Posted
19 July 2007 @ 10pm

Tagged
Programming

Add:     

A Subversion User Looks at Git

Subversion was, until yesterday, the only SCM system that I understood well enough to use. Today, I feel I can add Git to that list. The disclaimer on that which follows is that it’s mostly an understanding gained from reading documentation. Git appears to have an excellent documentation set but, if those documents mislead in some way, I have likely been misled too. Having said this, I’m not going to couch this in weasel words in order to appear circumspect. This is my current understanding of Git and its pros and cons. I may be wrong.

Basic Git Architecture

From an architectural perspective, Git is gloriously simple. There are four essential objects: blobs, trees, commits and tags:

A blob is strictly a piece of file content. I believe that blobs are generally segmented along file boundaries, but I haven’t yet worked out if blobs are also used to track portions of a file. Blobs are named by the SHA1 hash of their contents. This can lead to a performance problem if your files are large - as a pathological case, I created a git repository of several 500MB AIFF files - it took rather a long time and ate all my RAM. That’s hardly the normal case, however.

A tree assembles blobs and other trees into a hierarchical structure, matching the on-disk hierarchy of your files. A tree is essentially a mapping between a blob’s name (i.e. it’s SHA1) and file name. Trees are stand-alone objects in the history of a project - they don’t contain any information about where they came from.

The commit object refers to a tree - specifically, the state of the tree after that commit is applied - and contains some information about who committed and what was done. It is the commit object, rather than its related tree, which connects the commit to its predecessor (or predecessors, in the case of a merge).

The tag object simply collects the SHA1 sum of an object, the object’s type and a symbolic name. My understanding is that you could, in principle, tag a blob, a commit or a tree. I’m not completely certain whether one should tag commits or trees, but I suspect commits would be the correct object. It’s not clear that one can reach a commit from a given tree object.

Once nice feature of Git is that it allows you to undo or change a commit after it has been made. Here’s one example of where it’s super useful: I work between a desktop and a laptop machine. Using subversion, when I have to move machines, I commit my work in progress and then update the machine I’m moving onto. This is generally fine, but it means there are a lot of commits in the repository that represent points at which I wouldn’t normally commit code - where things are broken, incomplete or don’t compile. With Git and some care, you can commit your work in progress, pull the changes to another machine and then undo the last commit.

The Directory Index

One concept that exists in Git that doesn’t exist in Subversion in quite the same way is the notion of the Directory Cache. The directory cache is a file which describes a tree, although the tree which it describes may not exist in the repository yet. As you work, you add changes to this cache and when you commit, the tree described by the directory cache is written to the repository with an associated commit object. The key line from the documentation here is: “creating a new tree always involves a controlled modification of the index file” (ref: core-intro.txt).

The index file is not so very different in practice from Subversion’s idea of having added files that are not yet committed. The index file is Git’s representation of the same.

Having said that, git’s notion of “adding” a file is sightly different from Subversion’s. In SVN, you’re telling SVN to “start tracking file X”. In Git, you’re saying “take a snapshot of the content of file X and store it in the index for the next commit”. As a result, you have to - at least in principle - perform a “git add filex.c” every time you change filex.c. There is, however, some syntactic sugar in the form of “git commit -a” which adds all the modifications to known files and commits in one step.

This is pretty powerful: how often have you done some work on a feature and cleaned up some headers as you went by? When you’re done, you have to look at each file you’ve changed and perhaps do a number of commits to specific files. In git, you can just decide not to “git add” those clean-ups to the index until after you’ve committed the meat of your work.

Branching and Merging

Branches, and merges between those branches, are a central concept in Git. Given that this was developed to track the Linux kernel, this is hardly surprising.

Given that NIB files are not mergeable with any common merge algorithm, it’s not clear that this style of working would be terribly good for Cocoa development. The documentation does not say a whole lot about what happens to binary files. It’s not that Git is unsuitable for handling NIBs - far from it. I just observe that the the approach of frequently repeated branch-and-merge operations rather depends on a high probability of clean automatic merges to be bearable. The fact that a git merge will automatically commit in the absence of conflicts suggests that this expectation underlies the design.

Having said that, it’s no easier to merge a NIB in Subversion. It’s just that merging isn’t so commonplace an operation in SVN. The correct solution, of course, is for Apple to make NIB files more easily mergeable.

Repository Layout

One thing that I already love about Git is that it does not depend on putting a dot-directory in every directory in a working copy (recall that every Git working copy is also a repository). There’s one .git directory at the root of the repository and absolutely nothing else. Anyone who has had to check RTFd files into Subversion and then edit those files with TextEdit will be cheering right about now. For those who haven’t, understand that RTFd files are actually bundles, and bundles are directories. Thus, Subversion adds a .svn directory inside your RTFd file. When TextEdit saves this file, the .svn directory is lost and the file appears disconnected from its history.

For this fact alone, I’m looking at the implications of switching to Git.

What’s Missing

Currently, the only thing that appears to me to be obviously missing from Subversion is the concept of svn:externals. I use externals a fair bit in my SVN projects, and I’m not yet certain how one could replicate them in Git.

You can add so-called “remote tracking branches” in Git, in which your repository tracks a branch in the repository you originally created yours from or, indeed, arbitrary branches from arbitrary repositories. This lets you switch your working copy to another branch from somewhere else, but it doesn’t let you attach an arbitrary tree to an arbitrary point in your tree.

I suspect the approach might be to import source from some remote repository, create a tracking branch and then merge between the tracking branch and some subdirectory of your working copy. I have not yet seen any documentation on how to do this, nor on how to do it if the other repository is not using Git but, say, Subversion.

Conclusions

Git’s rethink of the entire content management problem enables some powerful new capabilities. I write this whilst on holiday with very sporadic net access. I’ve been coding away in my Subversion-managed projects, but unable to commit in sensible chunks without internet access. With Git, there would be no problem whatsoever.

Because few operations depend on the network, Git’s performance is excellent for most common operations and cases.

Git is confusing and alien to someone raised on CVS and Subversion, that much is certain. I feel like I understand the component parts of Git, but that I’m not necessarily entirely understanding their implications and interactions just yet. It also feels like Git gives you slightly more rope with which to hang yourself, but I do recall feeling that way about Subversion when I started using it. With SVN, I’ve come to trust that my usual workflow and conventions don’t produce broken results and, when I’m doing something new there are good docs to back me up. I suspect I could reach the same position with Git quite easily.

Finally, I continue to ask myself whether using Git would really confer serious advantages to a (usually) solo Cocoa developer. The answer is that I’m currently not sure, for the following reasons:

  • I rarely have several branches in active development at any one time. Even if I have multiple SVN branches, I’m usually only working on one at a time.
  • Git offers nothing new to the problem of merging NIBs.
  • Git’s optimistic approach to the probability of conflicts during an automatic merge is somewhat less likely for Cocoa projects than for, say, the Linux kernel.
  • I don’t often collaborate with large numbers of people on projects.
  • Git’s pretty confusing, even after reading the docs twice.
  • Everyone else uses Subversion (see the point about svn:externals).

Where does Git provide compelling improvements?

  • Much cleaner handling of bundle files.
  • The ability to revert a commit is something I’ve, er, occasionally had reason to wish existed in SVN :-)
  • Working disconnected on a laptop no longer requires either (a) a gigantic checkin once you get home or (b) picking apart your changes to commit separate features.
  • Performance is a feature.
  • The ability to explicitly define the contents of a commit in a structure other than the current state of the working copy is pretty nice.

There will certainly be more on this as time goes on. I’ve been hearing too much buzz about Git from people whom I respect to ignore it. I don’t hear anything about arch, monotone, BitKeeper, codeville, SVK or darcs from anywhere except the nerdiest of SCM nerds.


7 Comments

Posted by
Travis Risner
21 July 2007 @ 1am

Hi Fraser,

You have raised some very thought-provoking points. Git has questioned some of the assumptions of other SCM’s and has tried to be less tied to those assumptions. Hence the flexibility that you found.

You comments have not addressed some questions that occurred to me. I understand that you are very new to Git and may not have ready answers. I was wondering about source code. Does Git track the changes line by line? Can if give you a diff of what changed? Is there a way of drawing a line in the sand and saying that “exactly this was the way it was, as of this date/tag/commit?” even if a blob has been revoked later? Does Git track changes to directories or moved/renamed modules?

Are you aware that there are a number of SVN GUI programs available that make it easy to commit an arbitrary list of modules separately? Another advantage is that you can compare any two versions of a source module, and the differences will be highlighted? A set of changes can be revoked without branching, as well.

I ask these questions because, not only do I want to know more about Git, I hope these questions will help you to decide which SCM best suits your needs.

Thanks,
Travis


Posted by
Francesc Esplugas
21 July 2007 @ 9am

The same reasons you explain are those who made decide to move from Subversion to Mercurial.


Posted by
Diego Zamboni
21 July 2007 @ 9pm

Thanks for the summary. I’ve been wanting to look at git for some time (also being a heavy subversion user), and your post gives me some idea of what to expect.

Do you know if there’s any way of using git from Xcode? Mostly curiosity, since even with subversion I almost always use it from the command line and not from within Xcode.


Posted by
Enno Ruijters
22 July 2007 @ 1am

Regarding Travis’s questions:

The fundamental difference between git and other SCMs is that git tracks the contents of files rather than the file itself. For every commit, it takes a snapshot of all the files that you added/changed and keeps a copy of that entire file. So there is no way to ‘revoke’ a file afterwards. Deleting it just keeps it out of future commits. (So yes, checking out a commit is ‘exactly as it was’).

Git can give you a diff of what changed since the last commit or between any two commits, but it does so by comparing the snapshots it took. This, like the SHA1 hashes, can become quite slow for large files but for source code that’s not usually a problem.

Git doesn’t track directories as such (although there is currently a discussion among the developers and it might be added in the future), but it does keep track of their names as part of the commit, so as long as there is at least one file in a directory (or a subdirectory of it), renames are handled.

Only recently was support for something like modules added. Git calls it subprojects I believe but I haven’t used it yet so I can’t give you an answer there.

For some more information on this, Linus recently gave a talk at google about git that covers some of the questions you asked. You can find it at http://www.youtube.com/watch?v=4XpnKHJAok8.


Posted by
Mark Phippard
22 July 2007 @ 2pm

Great post. I have not looked at Git much, but thought I’d point out a couple of things about Subversion from your post.

It is not particularly difficult to revert a commit in Subversion. See http://blogs.open.collab.net/svn/2007/07/second-chances-.html as an example.

I did not understand the point you were making about adding files after you commit and the advantage that Git brings. Subversion can certainly do the same thing. As someone else pointed out, Subversion also has a number of great GUI clients that make this easier.

One thing that I think I read into your article that you did not specifically say, is that it sounds like with Git when you add a file it sort of freezes the content it added. So if you were to make future changes to the file and then commit, the original content would get committed, not the latest changes. This is different than Subversion and could potentially be useful.

Another thing to come back to was your comment about reversing commits. I thought one of the points of a distributed version control system like Git was that you could and would do lots of these checkpoint commits. Therefore you get more version control features during the development process. I guess what I am saying, is that I do not understand why you would want to reverse these commits. It seems like the point of Git is to empower you to make these commits without impacting others.

Anyway, thanks for the post.

Mark


Posted by
fraserspeirs
22 July 2007 @ 6pm

Mark,

The link you referred to about reverting commits in Subversion says:

“The only way to truly remove something from a repository is to dump the repository to a file, carefully remove the parts you do not want from the dump file, and then reload the repository.”

Certainly, you can make a further commit in Subversion that undoes the effect of your previous commit, but you can’t easily say “throw away the last commit I made” as you can in Git.

The main attraction of removeable commits is when I want to hop from my desktop to my laptop. I can commit whatever I was doing on the desktop - however broken it was - then move to the laptop, pull the changes, merge and remove the last commit.


Posted by
David Roussel
23 July 2007 @ 8am

Similary to you, I decided to read up on distributed SCM while on holiday.

I looked at Git a bit, but I chose to look at Mercurial in more depth. Git and Mercurial are very similar, but these are the more attractive features:
- Mercurial is simpler for the CVS/SVN user
- Far fewer commands to learn
- No need to pack your resposity
- Works on Linux, Mac OS X, and Windows
- Has Mercurial Queues (MQ) extension.

Be sure to read up on MQ: http://hgbook.red-bean.com/hgbookch12.html

The model I’ve come up with, for contributing to projects that use SVN publicly is to use hgsvn to create an mercurial mirror of an SVN repo. Now I can work offline, or even just have faster access to all history.
Then use MQ to develop a patch set (say one patch for the loggine system, one patch for the documentation, and another for the data access layer). Then I can work on the patches while keeping up to date (by using hgsvn) and then finally manually apply the patches to SVN.

Works a treat! No need to get *insert favourite project here* to convert to a distributed SCM.

David