Rate this page del.icio.us  Digg slashdot StumbleUpon

Shipping quality code with git

by Brenton Leanhardt

Git is a program for Source Code Management (SCM) whose complexity has been blown out of proportion. This may be due to the fact that early on it was primarily used by Linux kernel hackers who, needless to say, do not represent most users of SCM tools. Regardless of its past, today the UI is quite simple and there are only a handful of techniques a user needs to manage their code base with git–in ways that are nearly impossible to do with the mainstream alternatives. These techniques, which are mentioned in the order of their suggested usage, focus on improving the overall quality of the code base throughout the life of a project.

Good patches

While developers have been managing changes to code for decades, this task is not quite as simple as it sounds. These changes, or patches, range from simple one-line fixes to features measured in hundreds of lines. However, there is one goal that all good patches share, no matter their content. The objective is to introduce positive cohesive change to the code base. In practice, this means that unrelated changes should be contained in their own patches. As projects grow there is a need to group patches into cohesive sets for both organization and maintenance.

This article is as much about using git effectively as it is about creating quality patches. Before we go any further, let’s explain the need for this coupling.

Métro, Boulot, Codo

As much as project managers (or maintainers) like to believe their releases are well-planned and will go out on-schedule, the reality is that priorities, businesses, and staffing changes often disrupt the best intentions. A disciplined approach to submitting changes goes a long way to allowing projects to react with agility to these types of challenges.

While it may not be apparent to developers in the trenches, a bit of discipline gives those in charge of integrating and releasing code the confidence to add or remove features–even if they do not have an extremely low-level understanding of the implementation. Most people tasked with releasing projects are not nearly as concerned about particular files; they are more worried about entire trees and changes over time. They want to ship feature X and yank feature Y with accuracy.

Sadly, many of the mainstream tools for managing source code are poorly set up for these kinds of tasks, and thus, they are simply ignored in the field. This can be seen in any project where a large majority of the commit messages provide vague information like “checking in my stuff”.

Let’s observe the life cycle of a typical nice-to-have patch:

  1. Developer has an itch for a new feature so she downloads the project’s source
  2. Several changes are made–some good, some bad
  3. The feature is then tested
  4. A patch is submitted to the project’s mailing list
  5. People review the patch and eventually it is accepted upstream

On large projects, there often exists an arbitrarily long period of time in which the patch is simply left in limbo. This typically happens between steps 4 and 5. This can be due to many factors–often limited resources and/or time.

The main problem that patches face is that the longer they stay in limbo, the less likely they are to apply cleanly. And when patches carry the weight of several (often minor) unrelated changes, the effort needed to resolve a merge conflict can dramatically increase.

But never fear–when upstream has moved on and left a patch behind, there is still hope. Git can help a developer maintain a patchset until it is accepted upstream, and here’s how:

1. Rebasing

Rebasing is a powerful feature that greatly simplifies out-of-tree patch maintenance. The goal is to avoid the ‘weaving commits’ problem that linear SCMs suffer from. Consider the simple case of weaving between two developers working on unrelated features of the same project:

Instead of:

A1---B1---A2---A3---B2---A4---A5

It’s more readable to have this:

A1---A2---A3---A4---B1---B2

To make this work, typically there is some notion of “upstream.” In this example, A is upstream and B is a single developer.

   B1*---B2
  /
A1---A2---A3*---A4

Instead of leaving B’s patches in limbo, git-rebase can be used to stack the changes back on top of A’s.

                  B1*---B2
                 /
A1---A2---A3*---A4

* Represents a committed mistake that has no value

Should any conflicts arise, git-rebase will let you resolve them as you go and will even guide you through the process. Whenever it’s decided that B’s changes should ship, they will be ready to apply cleanly against the latest revision of A–without having to make sense of dozens of woven commits.

2. Squashing

An observant reader will notice we’re still weaving–just on a higher level. Commit-weaving is only a bad thing when it leads to an improper signal-to-noise ratio in the commit log. If these commits were all related, we would really have a situation more like:

A1`---B1`

This can be achieved when developers use a healthy dose of squashing. Git provides several tools for making this easy. After performing a rebase, all commits that do not exist upstream will be stacked on top. If they represent a cohesive change (and were authored by a single person), the simplest way to combine them is with git-merge, like so:

# Switch to the branch to merge into
git checkout master
git merge --squash 

Quite frequently the commits are arranged in more interesting ways. Another useful tool is git-cherry-pick:

git checkout master
# Using the “--no-commit” flag allows for combining the changes
git cherry-pick  --no-commit
git cherry-pick  --no-commit
git commit

Git also has great support for doing “reverse” squashes where a single commit is split into multiple patches. Below is an example of how to split a commit that has multiple unrelated changes in the same file. This should be familiar to anyone who has struggled merging changes from team members who use conflicting indentation. This is precisely the type of noise no one concerned with releasing code should have to deal with.

# It's a good idea to do this cleanup on another branch
git checkout -b cleanup 

# Reverse the git history one commit but leave the working tree intact
git reset --mixed HEAD^

# Interactively add commit “chunks” in foo.py to the commit staging area (aka index)
git add --patch foo.py

# Commit change one
git commit -m “Fixing the gnarly production issue”

At this point, the changes to foo.py have been committed, and the comment notes this. Now, pretend all the remaining chunks to be committed are related, and commit them together:

git commit -a -m “minor whitespace fixes”

The above approach is useful when changes are committed poorly and we want to clean up after-the-fact. To avoid the problem in the first place, git-stash offers a great alternative. See the documentation for more information.

3. Reordering

We have already hinted at the benefits of reordering. Rebasing is just one approach. When submitting multiple patches, having a logical order can make them easier to understand. There is a big difference between scanning 100 well-ordered (and cohesive) commits and 100 stream-of-consciousness commits. More importantly, the person reviewing the commits gets no value from commit messages like “oops” or “trying again.” If there is value in expressing a failed attempt, it should be explained in the commit log. For this next example we’re going to use git-rebase‘s powerful interactive mode.

# Start the rebase from the commit before things got messy
git rebase --interactive 

From here the user will be presented with a text file. Editing the file tells git how to perform the rebase. To reorder the commits, simply reorder the lines in the text file. Removing a line will result in that commit being deleted from the history. The rest of this powerful feature is documented at the bottom of the text file. As soon as the editor is closed the rebase will begin. The beauty of it all is that whenever git cannot figure out how to merge a change, it will stop and guide the user through resolving the issue. Just remember to continue the rebase after you commit with git rebase --continue.

Once patches are in a logical order, it’s often convenient to post them somewhere for review. With git this could be a branch on a public repository
or even a few patches posted to a mailing list. Those reviewing the code will appreciate the work put into the patch order and it could result in suggestions for a different or better ordering.

4. Documenting

With all this squashing and reordering, there exists another tool that can help developers produce quality code. This is the commit log. Many developers simply do not understand the utility of the commit log. It’s the one place that can holistically document code in a particular state. What makes git different than mainstream tools is that developers are not forced to worry about documentation at the time the code is first committed. Documenting code takes serious effort. In practice, undocumented code often gets pushed to the repository for fear of the work being lost. With git, this process can be iterative and will result in high-quality documentation over time. It should be no surprise that git-rebase is the key. If you tried the interactive rebase from the last example, simply mark the commits whose messages you would like to modify with the “edit” flag.

When commits are grouped into cohesive sets of changes and are placed in a logical order, developers can better understand the code base. For example, in any average project consisting of over 100,000 lines of code there will inevitably exist a chunk of code that no one understands and everyone is afraid to touch. When production problems arise with this particular block, it’s extremely useful to be able to jump through the last 40 or 50 of its permutations (that may span several years and more than a few different developers). This is precisely why changes not applied in a logical order will be too much to comprehend as time goes on. And you never know–by looking at the changes to a fearsome block of code over time, it may become apparent that the code in question was merely a vestige of a feature that never shipped.

En Fin

As in all areas of software development, there are no silver bullets and teams should feel free to adapt these practices as they see fit. A typical project can reap the benefits of actively maintaining and improving commits within a few weeks. With a little practice, git can make these techniques feel less like a chore and more like adding a new coat of paint to your restored ’57 Chevy–tough work but you know it’s worth it.

7 responses to “Shipping quality code with git”

  1. Michael DeHaan says:

    Yay, git! There are clearly a lot of commands (a gazillion of them), though in practice I find myself not even using many of the ones you mention.

    A few other nice ones — “git-format-patch” and “git-send-email” are clearly outstanding tools for people who want to submit code to projects and are worth a mention. Ultimately though, you can get a long long long way with just the basics. For those that want to learn more, see the tutorial here:

    http://www.kernel.org/pub/software/scm/git/docs/tutorial.html

    Mercurial is also another similar distributed system, but having had it clobber my remote repositories on more than one occasion, I much prefer git.

    I often liken it to an Italian sports car — it’s a high performance SCM and much better than your Ford Taurus (SVN), but it does take some time to learn how to drive it properly. And it is probably easier for a newb to get wrapped around a tree. But wouldn’t you rather have the sports car?

  2. links for 2008-05-05 « Bloggitation says:

    […] Shipping quality code with git (tags: git scm programming) […]

  3. David Woodhouse says:

    It should be mentioned that these tricks (rebasing, squashing, etc.) should be played behind closed doors only. When your tree is public and other people are basing their work on it, it’s very bad to rebase or change things around.

    “Just say *no* to rebasing” — Linus Torvalds.

    “In particular, people who rebase other peoples trees should just be shot.” — Linus Torvalds.

  4. David Woodhouse says:

    Not to mention that the example shown in section 1 is particularly gratuitous — it sounds like you’ve missed the fundamental point that git doesn’t have a linear progression of commits; it’s designed to have a _graph_ of commits. You could, and _should_, have just merged the tree with ‘B’s changes, as git was designed to do.

    B1*—B2———\
    / \
    A1—A2—A3*—A4—A5

    This preserves the ordering of commits and merges, and reflects the fact that changes B1 and B2 were designed, committed and tested on top of commit A1. And means that ‘git-bisect’ works as designed if we later need to track down bugs introduced in these commits.

    If developer B wants to use the ‘squashing’ trick to combine his patches into one commit, before he’s ever made them public, that might make sense. But if ‘A’ is upstream, or even an intermediate ‘subsystem’ tree, and suddenly decides that he’s going to break his tree and collapse commits ‘A1′ and ‘A2′ together so that it goes…

    A0—A2′—A3′—

    … when other developers were _using_ that tree as a base for further development, then that is really nasty.

    “Rebasing is fine for maintaining *your* own patch set, i.e. it is an alternative to using quilt. But it is absolutely not acceptable for *anything* else.” — Linus Torvalds.

  5. Brenton says:

    I probably should have mentioned this is how we maintain a large internal codebase. I completely understand (and agree) with your statements about the dangers of rewriting history. The truth is on our projects we indeed have a point in which the code is far enough downstream that no rewriting is allowed to occur.

    It all comes down to your first point of code that is “behind closed doors”. That’s the key for rewriting.

  6. Brenton says:

    Basically this article focused on shipping code. For an internal project “shipping” might me mean handing off the codebase to release engineering. For an open source project shipping can simply mean publishing a repo. Indeed, once the code has shipped it must be managed differently.

    I actually had in the original draft a warning about how to apply these practices in the context of the project you are working on. In hindsight it might have been good to leave that in. I think these comments sum it up though.

    On a side note your point on the correct use of merging is well taken. I didn’t want to go into all the detail of fastforward vs. recursive merges in the article. Our problem with merging on “non-shipped” code is that on our team non-fastforward merges were happening far too often. The history was cluttered with merge commits to the point where it was extremely annoying and needless noise IMO (for a tree that hasn’t shipped yet).

  7. Prepare-se para comer poeira do git « rafa.rocha says:

    […] Atualmente, nem tenho idéia, como estão se comportando os mantenedores do cvs/svn/ss/starTeam, entre outros quanto a estas mudanças. Benchmarks[1][] já estão sendo demonstrados, comparativos mostram-se publicamente justificados[1][2][3][4][5] o ganho que o git vem oferecendo. Basta apenas estarmos acompanhando estas evoluções e/ou colaborando nas principais listas existentes [1], bem como de boas ferramentas em processo de desenvolvimento em massa. Já tá rolando screenshots do plugin para o git(bem interessante o history views a partir do resource decorator na perspectiva logo abaixo na view da IDE do eclipse. isso vai virar febre logo, logo). […]