by Brenton Leanhardt
Git is a program for Source Code Management (SCM) whose complexity has been blown out of proportion. This may be due to the fact that early on it was primarily used by Linux kernel hackers who, needless to say, do not represent most users of SCM tools. Regardless of its past, today the UI is quite simple and there are only a handful of techniques a user needs to manage their code base with git–in ways that are nearly impossible to do with the mainstream alternatives. These techniques, which are mentioned in the order of their suggested usage, focus on improving the overall quality of the code base throughout the life of a project.
While developers have been managing changes to code for decades, this task is not quite as simple as it sounds. These changes, or patches, range from simple one-line fixes to features measured in hundreds of lines. However, there is one goal that all good patches share, no matter their content. The objective is to introduce positive cohesive change to the code base. In practice, this means that unrelated changes should be contained in their own patches. As projects grow there is a need to group patches into cohesive sets for both organization and maintenance.
This article is as much about using git effectively as it is about creating quality patches. Before we go any further, let’s explain the need for this coupling.
Métro, Boulot, Codo
As much as project managers (or maintainers) like to believe their releases are well-planned and will go out on-schedule, the reality is that priorities, businesses, and staffing changes often disrupt the best intentions. A disciplined approach to submitting changes goes a long way to allowing projects to react with agility to these types of challenges.
While it may not be apparent to developers in the trenches, a bit of discipline gives those in charge of integrating and releasing code the confidence to add or remove features–even if they do not have an extremely low-level understanding of the implementation. Most people tasked with releasing projects are not nearly as concerned about particular files; they are more worried about entire trees and changes over time. They want to ship feature X and yank feature Y with accuracy.
Sadly, many of the mainstream tools for managing source code are poorly set up for these kinds of tasks, and thus, they are simply ignored in the field. This can be seen in any project where a large majority of the commit messages provide vague information like “checking in my stuff”.
Let’s observe the life cycle of a typical nice-to-have patch:
- Developer has an itch for a new feature so she downloads the project’s source
- Several changes are made–some good, some bad
- The feature is then tested
- A patch is submitted to the project’s mailing list
- People review the patch and eventually it is accepted upstream
On large projects, there often exists an arbitrarily long period of time in which the patch is simply left in limbo. This typically happens between steps 4 and 5. This can be due to many factors–often limited resources and/or time.
The main problem that patches face is that the longer they stay in limbo, the less likely they are to apply cleanly. And when patches carry the weight of several (often minor) unrelated changes, the effort needed to resolve a merge conflict can dramatically increase.
But never fear–when upstream has moved on and left a patch behind, there is still hope. Git can help a developer maintain a patchset until it is accepted upstream, and here’s how:
Rebasing is a powerful feature that greatly simplifies out-of-tree patch maintenance. The goal is to avoid the ‘weaving commits’ problem that linear SCMs suffer from. Consider the simple case of weaving between two developers working on unrelated features of the same project:
It’s more readable to have this:
To make this work, typically there is some notion of “upstream.” In this example, A is upstream and B is a single developer.
B1*---B2 / A1---A2---A3*---A4
Instead of leaving B’s patches in limbo,
git-rebase can be used to stack the changes back on top of A’s.
B1*---B2 / A1---A2---A3*---A4 * Represents a committed mistake that has no value
Should any conflicts arise,
git-rebase will let you resolve them as you go and will even guide you through the process. Whenever it’s decided that B’s changes should ship, they will be ready to apply cleanly against the latest revision of A–without having to make sense of dozens of woven commits.
An observant reader will notice we’re still weaving–just on a higher level. Commit-weaving is only a bad thing when it leads to an improper signal-to-noise ratio in the commit log. If these commits were all related, we would really have a situation more like:
This can be achieved when developers use a healthy dose of squashing. Git provides several tools for making this easy. After performing a rebase, all commits that do not exist upstream will be stacked on top. If they represent a cohesive change (and were authored by a single person), the simplest way to combine them is with
git-merge, like so:
# Switch to the branch to merge into git checkout master git merge --squash
Quite frequently the commits are arranged in more interesting ways. Another useful tool is
git checkout master # Using the “--no-commit” flag allows for combining the changes git cherry-pick --no-commit git cherry-pick --no-commit git commit
Git also has great support for doing “reverse” squashes where a single commit is split into multiple patches. Below is an example of how to split a commit that has multiple unrelated changes in the same file. This should be familiar to anyone who has struggled merging changes from team members who use conflicting indentation. This is precisely the type of noise no one concerned with releasing code should have to deal with.
# It's a good idea to do this cleanup on another branch git checkout -b cleanup # Reverse the git history one commit but leave the working tree intact git reset --mixed HEAD^ # Interactively add commit “chunks” in foo.py to the commit staging area (aka index) git add --patch foo.py # Commit change one git commit -m “Fixing the gnarly production issue”
At this point, the changes to
foo.py have been committed, and the comment notes this. Now, pretend all the remaining chunks to be committed are related, and commit them together:
git commit -a -m “minor whitespace fixes”
The above approach is useful when changes are committed poorly and we want to clean up after-the-fact. To avoid the problem in the first place,
git-stash offers a great alternative. See the documentation for more information.
We have already hinted at the benefits of reordering. Rebasing is just one approach. When submitting multiple patches, having a logical order can make them easier to understand. There is a big difference between scanning 100 well-ordered (and cohesive) commits and 100 stream-of-consciousness commits. More importantly, the person reviewing the commits gets no value from commit messages like “oops” or “trying again.” If there is value in expressing a failed attempt, it should be explained in the commit log. For this next example we’re going to use
git-rebase‘s powerful interactive mode.
# Start the rebase from the commit before things got messy git rebase --interactive
From here the user will be presented with a text file. Editing the file tells git how to perform the rebase. To reorder the commits, simply reorder the lines in the text file. Removing a line will result in that commit being deleted from the history. The rest of this powerful feature is documented at the bottom of the text file. As soon as the editor is closed the rebase will begin. The beauty of it all is that whenever git cannot figure out how to merge a change, it will stop and guide the user through resolving the issue. Just remember to continue the rebase after you commit with
git rebase --continue.
Once patches are in a logical order, it’s often convenient to post them somewhere for review. With git this could be a branch on a public repository
or even a few patches posted to a mailing list. Those reviewing the code will appreciate the work put into the patch order and it could result in suggestions for a different or better ordering.
With all this squashing and reordering, there exists another tool that can help developers produce quality code. This is the commit log. Many developers simply do not understand the utility of the commit log. It’s the one place that can holistically document code in a particular state. What makes git different than mainstream tools is that developers are not forced to worry about documentation at the time the code is first committed. Documenting code takes serious effort. In practice, undocumented code often gets pushed to the repository for fear of the work being lost. With git, this process can be iterative and will result in high-quality documentation over time. It should be no surprise that
git-rebase is the key. If you tried the interactive rebase from the last example, simply mark the commits whose messages you would like to modify with the “edit” flag.
When commits are grouped into cohesive sets of changes and are placed in a logical order, developers can better understand the code base. For example, in any average project consisting of over 100,000 lines of code there will inevitably exist a chunk of code that no one understands and everyone is afraid to touch. When production problems arise with this particular block, it’s extremely useful to be able to jump through the last 40 or 50 of its permutations (that may span several years and more than a few different developers). This is precisely why changes not applied in a logical order will be too much to comprehend as time goes on. And you never know–by looking at the changes to a fearsome block of code over time, it may become apparent that the code in question was merely a vestige of a feature that never shipped.
As in all areas of software development, there are no silver bullets and teams should feel free to adapt these practices as they see fit. A typical project can reap the benefits of actively maintaining and improving commits within a few weeks. With a little practice, git can make these techniques feel less like a chore and more like adding a new coat of paint to your restored ’57 Chevy–tough work but you know it’s worth it.