james mckay dot net

because there are few things that are less logical than business logic
26
Mar

What is Git’s market share?

Git versus Mercurial arguments annoy me.

They annoy me because they’re fighting the wrong battle. Git fanatics who say that “Git has won” are so intent on killing off Mercurial that they’ve completely lost the plot with the issue that really matters. It’s old-school, inefficient, restrictive, trunk-based tools like Subversion and TFS that are the problem, not Mercurial.

People who say that “Git has won” point to the success of Github. While this is impressive, it doesn’t give the whole picture: a huge proportion of the industry is still stuck with Subversion, and the majority of corporate developers view the Github crowd as a bunch of arrogant prima donnas who believe that passion==competence and who think that they’re high-end developers simply because they blog, use Twitter, and know Ruby on Rails. Uncle Bob Martin is particularly scathing about people like that. Github is also dominated by developer tools and libraries, and seems to be significantly less popular among authors of userland software as far as I can tell.

Unfortunately, sorting out the facts from the hype isn’t easy. Version control surveys seem to be a bit thin on the ground, and usually have inbuilt biases that skew the picture somewhat. The most reliable ones would probably come from a company such as Gartner or Forrester Research, but I’ve found these a bit hard to pin down too. The most recent one that I could find was this survey from Dr Dobbs/Forrester Research (hat tip: David Richards of WANdisco):

Dr Dobbs/Forrester Research SCM survey results 2009

I see no reason to doubt these figures, though they are about three years old now and I haven’t been able to find a more recent repeat of the same survey.

Aside from that, the best I can come up with is the annual Eclipse Community Survey, which is conducted every April. Since Eclipse is an IDE that tends to be widely used in enterprise settings primarily among Java developers, it’s probably the best fit for what I’m looking for, and while it largely filters out the loud Ruby on Rails type fanaticism, it unfortunately also largely ignores the .NET world, which can be infuriatingly conservative at times. However, it does paint a picture in broad brush strokes that gives some indication of how things have been changing since then.

Their figures are as follows:

Year Git Mercurial Subversion
2009 2.4% 1.1% 57.5%
2010 6.8% 3.0% 58.3%
2011 12.8% 4.6% 51.3%

Some observations here:

  • The 2011 survey put Git in third place, just behind CVS (!) in second place with 13.3%. This represents a fivefold increase in two years, which makes it increasingly hard to argue that Git hasn’t yet “crossed the chasm.” The claim that “Git has won,” however, is quite clearly premature, given that Subversion users still outnumber Git users four to one.
  • Mercurial, coming fourth equal alongside Perforce, has a larger market share than I expected given the demographic: I was under the impression that outside of the .NET ecosystem, it was pretty much a lost cause these days. If you were to factor in .NET developers, its mindshare relative to Git would probably be somewhat higher, since many .NET developers are still dissatisfied with Git’s Windows support and usability story. Certainly, if you’re happy with Mercurial and don’t need to contribute to projects on Github, there’s no need to switch to Git on the basis of mindshare alone at this stage.
  • Subversion is starting to lose market share, and I expect this trend to continue if not to accelerate over the next year or two, so you should seriously be evaluating a distributed option for new projects sooner rather than later, otherwise you are at risk of being left behind. However, it’s too early to complain about existing projects still using it, especially if they are surrounded by a lot of process and infrastructure making migration difficult.

It’ll be interesting to see what the 2012 survey reveals, but extrapolating these figures would suggest that current market shares are probably somewhere around 18-20% for Git, 6-7% for Mercurial, and 40-45% for Subversion. This would put Git on course to overtake Subversion to the number 1 slot sometime towards the end of next year.

13
Jul

Feature branches versus continuous integration

I love asking questions that get people to question their practices and assumptions. Especially when I get the impression that a lot of these practices have been adopted unthinkingly by a lot of people. That is why, in the conversation on Twitter surrounding my critique of Martin Fowler’s position on feature branches, I posed the following two questions:

  1. If DVCS had come first, would Continuous Integration ever have been invented?
  2. What exactly do people understand Continuous Integration to be anyway?

In answer to the second question, you could of course point me to Martin Fowler’s excellent essay on the subject, but I just want to make sure that you’ve read and understood it yourself. For it contains one point that most people seem to overlook when discussing CI, and it is this:

Everyone integrates their work into the main line every day.

For Subversion, this would be trunk. For Git or Mercurial, it would be a single master repository that does not permit multiple heads.

Clearly, most teams do not do this. While there are many tasks that you can break down into functional tasks of less than a day, there are many more that you can’t. Consequently, if you wish to adopt this approach, you will be checking incomplete work potentially into production, hence the need for feature toggles.

It was this, specifically, that I described as a workaround for poor branching and merging support.

Here’s a little thought experiment. Martin Fowler wrote that article in 2006, at a time when few people had even heard of distributed source control. Just suppose that Git had come to prominence ten years earlier and that the DVCS workflows with frequent branching and merging were standard practice industry-wide. Would he have mandated pushing your changes to the mainline on a daily basis then?

Somehow, I don’t think so, and even if he had, it would have gained little or no attention. It would be widely regarded as a solution looking for a problem.

For what it’s worth, long lived branches are a bit of a straw man here. If you are following an agile methodology such as Scrum, your features will naturally be fairly limited in scope anyway so you shouldn’t ever reach that point. Nor am I convinced by the assertion that your mainline should act as a point of communication. That’s what your daily stand-up meetings, sprint retrospectives and face to face communication throughout the day are for.

Of course, there are some situations where a daily integration approach will be beneficial, or even necessary. If you are stuck with an old-school centralised source control system such as Subversion or TFS, for instance, you don’t have much choice. While it is possible in principle to adopt a branch per feature approach with these tools, in practice they enforce so much ceremony and risk around creating, merging, and switching between branches that it simply isn’t practical.

Another scenario is where your team is new to branching and merging. Some novice DVCS users start off with a lot of magical thinking and unrealistic expectations about branching, dive straight in at the deep end with long lived branches and indiscriminate refactoring, then wonder why things go pear shaped when they attempt to merge. My general recommendation is to start with a CI-style approach and gradually expand into small and then larger feature branches as your experience and maturity as a team grows.

Third, some tasks are better suited to feature branches than others. Generally, the more dependencies your code has, the more frequently you will need to integrate. Changes to your data layer or shared classes will need to be integrated pretty much as soon as they’re done, especially if you aren’t following the Open/Closed Principle, whereas on the other hand, changes that largely incorporate new functionality, or that only affect your UI layer, will work fine even on long lived feature branches. Knowing where the limitations and possibilities lie and how to mitigate them allows you to adopt a much more flexible approach to feature branches. But it is certainly possible to adapt your approach to the maturity and expertise of the team, and the needs of your project.

Finally, it is worth noting that feature branches are only incompatible with this one specific aspect of Continuous Integration. Other aspects of CI, and of course pretty much everything about Continuous Delivery, are completely orthogonal concerns, and there is absolutely no reason why you should not do both.

07
Jul

Why does Martin Fowler not understand feature branches?

Update: this subject seems to have generated quite a lively discussion in the blogosphere and on Twitter. If you want to see what people are saying elsewhere and follow the discussion further, here are some links:

Make sure you read the comments on these posts to get both sides of the story.


In the software development world in general, and in the agile community in particular, Martin Fowler is pretty much considered the ultimate authority on just about everything. He’s not like Joel Spolsky, who preaches a feel-good message laced with linkbait such as you should invent your own programming language but not use Ruby, or that the SOLID principles are too bureaucratic to be useful. He speaks in measured tones based on years of experience working with large enterprise clients. He (mostly) shuns controversy and focuses on value. He adopts a Neutral Point Of View. He is not just an agile coach, or an agile luminary, he is one of the authors of the Agile Manifesto itself. He has written several ground-breaking books on software engineering, including Refactoring and Patterns of Enterprise Application Architecture. Someone of that stature makes the best of us look like schoolboys in short trousers by comparison.

So when I came across this video of him and his colleague Mike Mason speaking on the perils of feature branches, I was absolutely horrified. Horrified, because much of what they have to say is totally misleading and completely misrepresents what branch-per-feature is all about. It sounds like the kind of anti-branch FUD that I only ever expected to hear from Highly Paid TFS Consultants and vendors of Subversion-based software, who have the most to lose from the next generation of source control tools that make branching and merging safe, robust and easy.

Note first of all what they’re saying.

  • They are saying that Feature Branching is incompatible with Continuous Integration.
  • They are not saying that you can shoot yourself in the foot if you do Feature Branching wrong, but you can be much more productive if you do Feature Branching right. They are saying that Feature Branching is bad. Period.
  • They are not saying that some tools aren’t cut out for Feature Branching but others are, and if you’re stuck with one of the ones that aren’t, here’s how to make the most of it. They’re saying that Feature Branching is bad. Period.

Now maybe they may argue that they don’t actually mean that, or that I’ve misunderstood them, but the problem is that that is what the anti-DVCS scaremongers will hear, and as sure as eggs are eggs, they will milk it for all it’s worth. Certainly, the whole tone of the video towards feature branches is strongly negative. And in some places, factually inaccurate. Besides which, the alternative that they propose is no better.

Feature branches are not about sporadic integration.

Fowler contrasts feature branching with Continuous Integration as if the two are diametrically opposed to each other. He says this:

The cool rule of thumb that I use for Continuous Integration is that everybody’s integrating back to the mainline at least once a day, while with feature branching, people can go for many days, possibly many weeks, until they complete the feature, and then they only integrate once the feature is done. And it’s that frequency of integration which is the key difference.

No, no, no, no, no, no, no! That is not what feature branches are about!

Feature branches are not about increasing the amount of time between integration, they are about decreasing the time between check-ins. You still break down your features into the smallest tasks you can deliver. You still integrate as frequently as you can, yes, at least once a day where possible. The difference with feature branches is that you start checking in code several times an hour. Why? First, this gives you checkpoints that you can roll back to if you discover that you’ve made a mistake, and second, it makes the thinking behind what you’re doing easier to describe and decipher.

The reason why people think feature branches are for long-running epics is because that is the way that old school centralised tools have taught them to think. They’ve made branching and merging hard through poor tooling, algorithm support and architecture, so people do everything in trunk except for epics, then they run into problems and the whole pattern becomes self-reinforcing. Whereas the correct way to do it is to make branching and merging the default option, so people start off with small, frequent merges, and naturally expand into epics as they develop the maturity to handle them.

Feature branches versus Continuous Integration is a false dichotomy. You don’t do either/or, you do both/and.

Semantic conflicts are not specific to branching and merging.

Fowler then goes on to bring up the issue of semantic conflicts.

And the issue here is, yes they can make the problems of textual merging go away, and certainly the textual merging algorithms have improved greatly over the last ten or twenty years, but there’s still semantic issues that go deeper than the pure text, and they, those do not go away, and as a result you still have big, painful merges and the pain increases exponentially the longer you leave between integrations. So by integrating more frequently, even though you’re doing a lot more integrating, the integrations are sufficiently small that they’re not actually causing a lot of problems and as a result you can do them a lot more smoothly as you go along.

I’m sorry, but that is a straw man. Yes, semantic conflicts are a problem, but they are not specific to branching and merging. Any kind of integration will have semantic conflicts. Trunk-based development will have semantic conflicts. A lock/edit/unlock model such as that adopted by Visual SourceSafe will have semantic conflicts. Everyone editing the same files directly on the same server and not using source control at all will have semantic conflicts. Semantic conflicts are a fact of life no matter what integration model you use. Get used to it.

Besides, the integrations where you need to worry most about semantic conflicts are not the biggest, scariest ones. By the time you get to that level, you are looking at textual conflicts all over the place anyway, so you’ll be on your guard and double checking everything. The semantic conflicts you need to worry about are the ones where there are no textual conflicts at all, where you have dropped your guard. In other words, the size of merges that are typical of Continuous Integration. In any case, we have tools to deal with them. They are called compilers, static analysis, and unit tests. As for any that slip through that net, all you can do is treat them like any other bugs. The question here is, how many of these semantic conflicts translate into new bugs and regressions in your code compared to bugs introduced through any other means? When you are used to feature branching, that is exactly how you view semantic conflicts: as no greater or less a deal than any other form of bug.

For what it’s worth, semantic conflicts have been the subject of some academic research, for example, in this paper, which is well worth a read for anyone who is concerned about them, their prevalence, the risks that they may or may not pose, and strategies to mitigate their effects. But it would be naive and wrong to be paranoid about feature branches because of them, because, as I said, they can crop up just as easily in trunk-based development.

Feature toggles can cause problems too. Bigger, scarier problems.

Feature toggles are a way of avoiding branching and merging, which Joel Spolsky describes as “a workaround for the fact that your version control tool is not doing what it’s meant to do.” Branch by Abstraction is much the same thing, except that it uses IOC containers instead of if statements. Mike Mason explains it as follows:

So, feature toggles are a mechanism whereby you implement a new feature, but simply have it switched off, usually in the software configuration, and in most systems that can be as simple as not including a feature in the UI, so it’s actually being developed under the covers, it’s present in the source code, but you don’t enable it in the UI so nobody actually sees it.

You see what the problem is here? And man, it is a massive problem—visible or not, you are still deploying code into production that you know for a fact to be buggy, untested, incomplete, and quite possibly incompatible with your live data. Your if statements and configuration settings are themselves code which is subject to bugs—and furthermore can only be tested properly in production. They are also a lot of effort to maintain, making it all too easy to fat-finger something. Accidental exposure is a massive risk that could all too easily result in security vulnerabilities, data corruption or loss of trade secrets. Your features may not be as isolated from each other as you thought they were, and you may end up deploying bugs to your production environment. I’m not speaking theoretically here either: this has actually happened to me. Unless your feature toggles are well bedded in, and you have some pretty mature processes and infrastructure in place to manage them, you’re asking for trouble. Certainly, my own experience of them has been uniformly negative.

But wait! He actually goes on to acknowledge these concerns, albeit in drastically watered down and sugar-coated terms compared to the sheer FUD that gets spouted against feature branches:

The traditional concern with that approach is that you’ve got a half finished feature in your codebase, so OK it’s not visible in the UI, but you made some changes, surely you may have broken something? And the right response to that is to have confidence in your automated testing that you should also have, to guarantee that even with a half-finished feature that’s hidden with a feature toggle, you haven’t broken anything else.

Right. So, after having given us no guidance whatsoever on how we can use feature branches responsibly and effectively, they then go on to give us guidance on how to responsibly use what is, when it boils right down to it, a high-risk, high-maintenance hack. If that isn’t a lopsided view of the subject, then what is?

In fact, they go on to talk about one case where the client had so many feature toggles that they needed to have their own custom compilation of the Linux kernel just to handle them all. But oddly enough, they don’t translate that into “feature toggles considered harmful” in the way they do for feature branches.

For what it’s worth, this highlights a problem with the very idea of Continuous Integration itself.

Continuous Integration is largely a workaround anyway.

It seems that all the FUD about feature branches boils down to one thing: we should restrict ourselves to trunk-based development because Continuous Integration is the One True Way to do configuration management. But let’s just take a step back and ask ourselves why we do Continuous Integration in the first place? Largely because we were restricted to trunk-only development. If you check in code that breaks the build, then go home, and then someone else checks it out, they can’t get anything done till you return and fix it. You constantly need to have a ready-to-deploy version of your code in case of security vulnerabilities. While this isn’t the whole picture, and there are other arguments in favour of Continuous Integration, it is at least partly a hack to work around the restrictions of trunk-based development. The vagueness of Martin Fowler’s “cool rule of thumb”—that you should check in “at least once a day” is testimony to this. Cargo culting your way through advice like that will lead to you checking in incomplete, buggy or even downright broken code, and the need for high-maintenance hacks such as feature toggles to compensate.

A hack to work around the limitations of another workaround for the limitations of our tooling. All this being lauded as a best practice. What on earth is the world coming to?

But these restrictions don’t apply to DAG-based tooling, where you can easily base your work off the last known good revision. Yes, you can end up with conflicts when you branch and merge, and yes, you can easily shoot yourselves in the foot if you’re not careful, and yes, you can end up with Big Scary Merges, but with good communication and planning among developers, many teams can—and do—easily and confidently handle surprisingly long-running branches with few if any problems. And yes, you should be keeping your work items as small as possible so that you can integrate as frequently as possible. Yes, feature branches get misused. But the correct response to misuse is not disuse, but proper use.

The most dangerous factor in this matter, however, is Martin Fowler’s reputation. Many agilists view him as infallible when speaking ex cathedra on matters of architecture and methodology, and for him to promote points of view such as these is to hand ammunition to the anti-DVCS lobby on a plate. Were this to lead to a wholescale rejection of feature branching and distributed source control, it would set the entire agile movement back several years, and that would be very sad.

What is really needed is a holistic, unbiased review of both feature branches and Continuous Integration, to see how we can use feature branches responsibly, and at the same time how Continuous Integration and Continuous Delivery can best be adapted to take them into account. But simply insisting on hanging onto the established ways of doing things with complete disregard for new and better technology is cargo cult programming, and it certainly isn’t agile.

15
Jun

No, WANdisco, Git does NOT promote anti-social development

David Richards of WANdisco has this to say about Git:

Funny, I was talking about this only today with an industry analyst and he has the same conclusion that we have. Git has its uses but probably not in the enterprise. OK please listen, I know that statement will upset a bunch of senior developers who think that GIT solves everything but it really doesn’t.

If you think about it GIT actually promotes anti-social software development; development in small, disconnected silos is not how software is developed in the real world. Most software is developed by teams whose members have a variety of skills who need to see what each other is doing and that’s the fundamental reason why GIT is not a threat to Subversion in the enterprise. It’s fine for the development of the Linux kernel but that model doesn’t work for most companies.

I’m sorry, David, but that is just wrong. It’s not just wrong in an existential sense, this isn’t just a Kantian viewpoint where you really need to consider what Schiller had to say on the subject, it’s wrong in the sense that 1+1=3 is wrong, or that the earth is flat is wrong, or that astrology is wrong. In other words, it’s just plain wrong. It is complete and utter FUD, it is a total misunderstanding of what easy, fluent branching and merging actually gives you, and it is totally detached from reality.

This argument, like every other anti-DVCS argument that you hear from Blub programmers and people who are being paid money to promote the likes of Subversion and TFS, boils down to, “These are the problems that you might encounter with DVCS if you don’t use it properly.” Compare that to the point that we DVCS guys make about centralised tools: “These are the problems that you are encountering because you can’t use it properly.” That is why Git and Mercurial are so popular these days among developers. They aren’t just a passing fad — the traditional, line-based, centralised model simply doesn’t work.

Oh, and by the way, there are plenty of success stories of DVCS in the enterprise.

The fear that’s being promoted here is that people will use branches as an alternative to communication. In practice, that simply doesn’t happen with Git and Mercurial users. While there’s always a possibility that you’ll end up with wildly divergent branches that give you a Big Scary Merge, this is a problem that can easily be avoided and is in fact generally self-correcting in the long run. Once you’ve got a feel for the limitations to what you can achieve with branching and merging, you tend to work within, and adapt to, those limitations. It’s called “learning from your mistakes,” which is an essential part of agile software development (and if you’re not doing some flavour of agile, you’re doing it wrong).

On the other hand, centralised source control, with its restrictive line-based model, presents you with problems that are difficult if not impossible to fix. It forces you to make compromises in how often you check in code for starters. Siloed, anti-social development happens just as much in Subversion as it does in Git, and with far worse consequences, since the equivalent in Subversion is delaying checking in anything at all till your whole work item is complete. When this happens, you run the risk of losing hours if not days of work — you end up with so many conflicts when you run svn update that you get confused and have to start over from scratch.

Before you start making a judgment about distributed source control, please do us all a favour. First, actually take the time to learn what you’re talking about. If you can’t explain to me what git bisect does, or what the tangled working copy problem is and why it’s a problem, or what a DAG is, or what the difference is between merge and rebase, then you’re not qualified to dismiss DVCS as unsuitable. It’s as simple as that. Then, actually try it out in practice on a non-trivial team project. Because again, if you are just going off hearsay rather than experience, you are simply not qualified to dismiss it as unsuitable.

04
Apr

Why merges can (and should) be automated

Long time Mercurial users will no doubt appreciate the new merge and conflict resolution dialogs in TortoiseHg 2.0. When you have some conflicting files, rather than making you go through them one at a time with no idea how many more there are to handle, you are given a list of them with options to help you merge them.

TortoiseHg Conflict resolution dialog

However, there is one feature of this dialog that will no doubt raise an eyebrow or two. Whenever a file has been modified on both sides of the merge, it reports it as a conflict, even if the modifications were to completely different parts of the file. What is going on here? Has Mercurial suddenly forgotten how to merge? Is it turning into Team Foundation Server? Whatever next, read-only files and baseless merges?

Actually, no it hasn’t. That was my reaction when I first tried out the development version of TortoiseHg 2.0 last summer, so I rolled up my sleeves and coded up an option to restore the traditional behaviour:

TortoiseHg options - Autoresolve merges

When you merge, you can also choose on a case by case basis between automatic and manual file resolution:

TortoiseHg merge wizard - autoresolve merges

So why does it work this way now? In my discussions on the TortoiseHg mailing list, Steve Borho, the lead developer of TortoiseHg, pointed out that there’s a lot of hallway usability testing behind it:

I’ll allow that long-time Mercurial users may find this limiting, so I’ve assumed that we’ll eventually add a back door to revert to default Mercurial behavior.  But I have heard from many new users over the years that this is the one part of the Mercurial interface that is unsettling, having kdiff3 thrown at them at seemingly random occasions, so I want the internal:fail approach to be the initial default.

André Sintzoff concurred:

I agree with you. Most of the new users I know are somehow disturbed by the “old” merge behaviour.

When I show them the “new” behaviour, they are enthusiast.

They had a valid point. This is something I’d forgotten about myself.

Inexperienced developers are usually terrified of merging. When you’re combining two people’s changes together, you need to know and understand not only the changes themselves, but the context as well. To delegate the entire process to some unknown computer algorithms sounds reckless and dangerous. This was one of the first things I found intimidating about svn update when I first started using source control in a team context in the first place.

Yet in practice, fully automated merging works remarkably well. When you run svn update or hg merge, more often than not, it all goes very smoothly — in fact, much more so than attempting to merge everything manually. Why should this be?

1. In the overwhelming majority of cases, the default option is the correct one.

Next time you do a merge, turn off automatic conflict resolution and use a three-way tool such as Perforce Merge. I particularly like Perforce Merge because it shows you exactly what’s going on. At the top, you have the two sides of the merge on either side of the original version, so you can tell whether something was added on the left hand side or whether it was deleted on the right hand side:

Merge resolution example

In the most basic case, automated merge tools assume that if a change was made to one side of the merge, but there is no corresponding change on the other side, that change should be included in the final result. That’s what shows up in the bottom pane. On the other hand, if two people have edited the same part of the file, it shows up as a merge conflict and you have to resolve it manually.

Now here’s the key. Once you’ve carried out a few manual merges, you soon realise that with non-conflicting text differences, you almost never choose anything other than this default option. It becomes evident that working your way manually through a string of differences where you only ever choose the default is largely a waste of time.

2. Manual merges increase the risk of human error.

Having said that, automated merges don’t always get it right, and you do sometimes need to be aware of the context on each side. But — and it is a big but — manual merges fare no better.

Here’s a simple example where both automatic and manual merges are liable to give the wrong result. Let’s say that two developers, Alice and Bob, both make an identical change to a source file on their respective branches — for example, throwing an exception when something can’t be found. Then, Bob commits a subsequent change which backs it out. Should the new code be included in the merge or not?

     Changed ------
    /              \
Original         ???????
    \              /
     Changed - Original

Mercurial and Git both take the line that because Bob undid the change, it should be as if he had never made it in the first place — a feature called “implicit undo” — and therefore, the change should be included in the merge. But that is not necessarily what you want. An ideal version control tool would report this as a conflict, but what happens then?

Here’s what it might look like in your merge tool:

Merge resolution with implicit undo

There is no indication whatsoever that as well as being added on the left hand side, that exception was also added on the right hand side and then deleted again. Because your manual merge is a naive three-way merge, with no awareness of history, it also gives you implicit undo, and unless you are particularly on the ball and aware that this change was made then undone in the first place, you won’t pick up on it.

But if you’ve just worked your way through a dozen or more diffs where you’ve chosen the default option every time, the chances are that your eyes will be glazing over, you won’t be on the ball, and you’ll miss it. And therein lies the rub: as well as being slow, manual merge resolution increases the risk of human error.

Another problem with manual merge resolution is that it frequently presents you with diffs that are pretty confusing and overwhelming. Visual Studio .sln files are a particular pain to work with in this respect, since you are dealing with lines and lines of GUIDs that blur into each other. Very often, the only difference between the two sides is that stuff has been moved around. In cases such as these, it can be almost impossible to carry out a manual merge effectively, whereas an automated merge will work out fine. Long lines just compound the problem.

So there’s your trade-off. An automated merge, which may or may not be correct due to ignorance of context. Or a manual merge, which may or may not be correct due to human error and lack of clarity of both context and content. And is several orders of magnitude slower into the bargain.

3. Semantic resolution is easier dealt with by compiling and testing anyway.

The upshot of this is that merging is actually a two-pass process, regardless of how you do it. The mechanical operation of combining your changes is not the be-all and the end-all, but only the first step. Once you’re done with it, you will need to test your merge and fix up any problems. But this isn’t a big deal — it’s the kind of thing you’re doing all the time in normal coding anyway.

Besides, manual resolution only gives you a narrow view of what you’re doing. It’s only when you compile and test that you really see how the two sides of the merge fit together and get a feel for how to deal with the context and intent of the two sides of the merge.

Most problems with merges show up when you attempt to compile your code. In these cases, it’s merely a case of fixing them up — clearing up ambiguous references, checking renames and so on. If you have good test coverage (and you should have good test coverage), your unit tests will pick up the majority of other problems, though you do need to be aware that incorrect merges may have an impact on your tests too. And while some problems may slip through the net, they generally are pretty insignificant in number and scope compared to bugs that creep in through normal, everyday coding.

Fully manual merge resolution is helpful for new users because it eases them gradually into the apparently scary world of branching and merging. But once you are used to it, it becomes apparent that there is little or no benefit to the all-manual approach. While you may feel more in charge of the process while you’re carrying it out, this is largely illusory and a waste of time, somewhat akin to premature optimisation. Provided that your tooling has decent automatic merge support — and Mercurial certainly does have decent automatic merge support — there’s every reason to make the most of it.

22
Feb

On named branches in Mercurial

There seems to be a common misconception among some Git users that in order to branch your code in Mercurial, you have to clone your repository. While some Mercurial users prefer to work that way, it isn’t actually necessary, and Mercurial does provide you with a much more lightweight alternative. The easiest way to branch your code is simply to hg update to the revision off which you wish to branch, then when you next hg commit, it will implicitly create a new branch for you. Similarly, when you hg merge, it will implicitly close the branch off. I tend to use a mixture of the two approaches, with repository clones for longer-running feature branches, and in-place branching for ad-hoc experimentation, smaller features, and the like.

A lot of confusion seems to centre round the concept of named branches though. If you’re used to the way Git works, you’d be forgiven for thinking that pulling from a remote repository would replace your “foo” branch with the incoming one, sending your work off to be garbage collected unless you merge immediately after pulling. Mercurial doesn’t actually work that way — what you get is two parallel branches, both called “foo”, which you can then merge, rebase or strip out as appropriate. This is because Mercurial tends to view the DAG as more immutable than Git does, and if you want to remove branches that are no longer needed, you do it explicitly using hg strip (a part of the Mercurial Queues extension).

For what it’s worth, I don’t like the way Mercurial uses the word “branch” here, since it doesn’t accurately reflect what you expect the word “branch” to mean: a single code line where every node in the DAG has exactly one parent and exactly one child. It seems to me that it’s something of a leftover from centralised, line-based tools such as Subversion and Perforce, where every branch has to have a name because of the need to place it somewhere in the file system.

But I don’t find it a big deal. I find the best way to handle branching and merging in Mercurial is to view your branches as essentially anonymous. Branch names, tags and bookmarks then become purely a documentation layer added on top of the DAG. I personally view branch names in particular as largely vestigial and almost never use them — I always commit exclusively to default, and generally recommend others to do the same unless they have a valid use case for them. If you need to keep track of which head is which, the bookmarks extension provides similar functionality to Git branches, and is far less confusing.

Incidentally, one DVCS that does seem to require you to clone your repository in order to create a new branch is Bazaar. I’ve spent a few hours tinkering with Bazaar on and off over the past few months and I haven’t yet been able to find a way to branch in-place similar to hg update/edit/hg commit or git branch. Perhaps someone could enlighten me?

14
Feb

Team Foundation Server is the Lotus Notes of version control tools

tl;dr: Advocates of Team Foundation Server, Microsoft’s ALM suite, respond to criticism by saying that TFS is not just source control but an end-to-end integrated ALM suite. This completely misses the point of our criticism of TFS in the first place: that we find it restrictive, bureaucratic, unreliable, and extremely difficult to use. End to end integration does not justify unreliability or a poor user experience.

About a year ago, Martin Fowler conducted a survey of ThoughtWorks developers to find out what they thought about various source control tools. Not surprisingly, Git came out top. The one that came out bottom? Team Foundation Server. In fact, TFS was unique in getting no positive responses at all: out of 54 respondents who had used it, every single one of them rated it as either “problematic” or “dangerous.”

Team Foundation Server advocates claim it’s unfair to compare TFS to other source control tools, since it’s not just source control, but an integrated end-to-end application lifecycle management solution. Comparing TFS to, say, Subversion, is like comparing Microsoft Office to Notepad, so they say.

Now where have I heard something like that before? Oh yes, Lotus Notes:

The main focus for frustration is Notes’s odd way with email, and its unintuitive interface. But to complain about that is to miss the point, says Ben Rose, founder and leader of the UK Notes User Group (www.lnug.org.uk). He’s a Notes administrator, for “a large automotive group”.

“It’s regarded by many as an email program, but it’s actually groupware,” Rose explains. “It does do email, and calendaring, but can host discussion forums, and the collaboration can extend to long-distance reporting. It will integrate at the back end with huge systems. It’s extremely powerful.”

The thing is, it wasn’t the detractors who were missing the point. It was the Lotus Notes guys. You see, e-mail is right at the heart of any groupware application. It’s the part of the application that users interact with the most. It’s where usability matters the most. And it’s what Notes got wrong the most.

It’s exactly the same with ALM tools. Source control is the part of your ALM tool that is most visible to developers. It’s source control rather than, say, work item tracking or continuous integration, that can make or break your workflow. It is source control where a zero-friction experience is most important.

Team Foundation Server is not zero-friction. Not by a long shot.

I guess if you have only ever used TFS, Visual SourceSafe, and perhaps exclusively trunk-based development in merge-paranoid Subversion teams that use if statements and configuration settings to avoid branching, you would be happy enough with it, since that’s all that you know source control to be capable of. But once you’ve actually used one of the alternatives that offers you fluent, unrestricted branching and merging, a local sandbox, flexible workflows, self-consistent best practices, and source control as an extension of your undo button, the limitations of TFS become so massive that it’s not even funny any more. (Incidentally, if you tot up the figures in Fowler’s survey, you’ll find that his respondents had, on average, experience with six different tools.)

But even if you’ve never used a DVCS and are only comparing it to Subversion, it’s still a usability disaster. Subversion may have pitfalls and gotchas and limitations of its own, but once you know your way around it, you can at least work as fluently with it as is possible with a primarily trunk-based, centralised tool. In TFS, even the simplest tasks become Herculean undertakings. How do you back out a changeset that isn’t the latest, for instance? Why can’t I have a check-in screen that shows only the files that have actually changed since my last commit? Why does it take me half a dozen mouse clicks for each file in my check-in screen to find out that it doesn’t have any changes? Why is it asking me to check in files that don’t have any changes in the first place? Why does it turn Visual Studio into a Berlin Wall around my code with these awful read-only files? Why does it lobotomise the branching and merging experience with baseless merges, making feature branches — pretty much a must-have for a pain-free ALM experience these days — impractical for all but the largest tasks? Why can’t it cache my login credentials to a server on a different domain like Subversion does? Why does the command line interface bring up dialog boxes? Is it a command line interface or isn’t it? And that’s barely scratching the surface of its usability problems. It doesn’t even have a search tool to speak of.

Furthermore, the source control component is the one part of TFS that you can’t swap out for something else. You can use TFS source control with Trac, Mantis, FogBugz or Jira, or with TeamCity or FinalBuilder, but you can’t use TFS work items or TFS build servers with Subversion, or Git, or Mercurial. As far as TFS is concerned, source control is their way or the highway.

End-to-end integration is all very well, but it is hardly a killer feature, and when the most visible component that it integrates is difficult to use and gets in the way, it ceases to be an asset and it becomes a liability. It’s far better to have a selection of separate tools, each of which is designed to do its job well, than a single monolithic application that does everything badly.

07
Feb

How often should you check in code?

There’s a lot of confusion among developers about how often to check in code to source control. Many projects have histories riddled with huge commits making sweeping changes to dozens of files, often with only a vague commit summary or even no commit summary at all. Those projects that have guidelines and policies in place usually don’t have a clear justification for those policies, and some of them are downright unhelpful, such as, “at least once a day,” or “whenever you come to a natural break in your workflow, such as lunchtime.”

The problem is that if you’re all doing everything in a single branch, typically trunk, it is not possible to come up with a straight answer to the question.

Should you check in early, check in often, as Jeff Atwood once described as the golden rule of source control? This ensures that you never lose much code, you keep up to date with everyone else, and you don’t go dark. However, if you’re all working on different tasks on the same branch, you will end up with two sets of unrelated revisions tangled up together in your history, and if one needs to go live, like, yesterday, and the other has had to be put on hold for any reason, as happened to us at the end of our last sprint, you’ll run into difficulties.

Alternatively, you could check in only completed units of work. However, this causes other problems. Deferring check-in until a unit of work is complete often results in huge, monolithic commits that increase the risk of integration conflicts. Furthermore, if you get into a mess attempting to resolve said integration conflicts, there is no way to back out to where you were before you ran svn update. I’ve had colleagues in this situation end up with no option but to roll back to the latest revision in source control, losing days of work that only existed in their working copy in the process.

Furthermore, large, monolithic commits are impossible to describe comprehensively and accurately in a commit summary, and they cause problems when carrying out a binary search of your history for the revision that introduced a bug.

Of course, you should be dividing your work up into smaller units as much as possible anyway to minimise the risk of this happening, but this isn’t always possible. Whichever of the two options you choose, you’re going to run into problems sooner or later.

Having a separate branch for each feature resolves this dilemma neatly. This sounds scary at first if you aren’t used to branching and merging, but providing your tooling supports it, it isn’t as bad as it sounds, since feature branches are usually fairly short, so you don’t get as many Big Scary Merges as you would expect. Besides, even when you do get a Big Scary Merge, it’s better than an otherwise identical Big Scary Commit, because if your attempts to resolve the conflicts go wrong, you can at least roll back to what you had before you attempted the merge and try again.

With that in mind, we can come up with some more sensible guidelines on how often to commit to source control.

1. Every commit should serve one, and only one, purpose.

This is a straightforward corollary to the Single Responsibility Principle. If you have to use the word “and” or “also” in your commit summary, you’re probably checking in too much.

If you have two unrelated changes in your working copy, you need to break them up. This is called the “tangled working copy problem,” and modern SCMs give you tools to sort it out. If you’re using Subversion or TFS, on the other hand, well, you should have been more careful. Unfortunately, it can be pretty hard, or in some cases even impossible, to avoid.

Needless to say, you should never check in code to two separate branches, let alone to two separate products, in a single commit, even if your source control allows you to do so.

2. Every commit should be small enough to be described in detail in the summary.

Your commit message won’t necessarily cover every last line of code in your change. If you’ve added a whole bunch of stuff, as long as it’s reasonably self-explanatory and isn’t riddled with meaningless method names such as doIt(), a single line commit message may suffice. But the combination of your code and your commit message should explain every line that has changed. And if you’ve removed or edited existing code, that will all need explaining in your commit summary too, particularly if it’s counterintuitive or at first sight could be mistaken for a bad practice, such as changing an encoding from UTF-8 to 7-bit ASCII.

If your commit is too large to make this practical, your commit is too large, period.

3. Every commit should build and (usually) pass all your unit tests.

Some DVCS users may disagree with me on this one, insisting that you can use your local history as a sandbox for your commits, so it doesn’t matter, but I stand by it. Broken builds have to be marked as untestable by your bisect tool, which complicates pinpointing the change that introduced the bug. A string of broken builds in succession makes matters worse. Besides, both Git and Mercurial provide mechanisms to allow you to resolve this situation by combining breaking changesets with ones that fix them — namely, interactive rebase (or git commit --amend) and Mercurial Queues respectively.

The only exception to the rule that every commit should pass your unit tests is when you are working in a test-driven manner, where you write a failing test then write code to make it pass. Here, you may want to consider checking in the new test separately from the code to fulfil its requirements, in order to audit just how test-driven your development really is.

4. Use feature branches liberally, and merge to your main development branch only when the task is complete.

This guideline is a more sensible version of “check in only completed units of work.” Single-responsibility, easily describable commits are obviously fairly small and frequent (a few lines of code, representing less than an hour’s work), and usually do not represent a completed unit of work.

That’s why feature branches are so important if you are to observe best practices with source control. In this case, “check in only completed units of work” becomes “integrate only completed units of work,” and the conflict between the two different best practices is thereby resolved. When you merge, always say what you are merging, with an issue number in your bug tracker where appropriate. Don’t just write “Merge.”

In an ideal world, every feature should be developed on a separate branch. With a modern DVCS, this is of course the default, and very easy. With centralised source control, however, it can take considerably more effort depending on your tool and your project setup, but it is by no means impossible. In cases such as these, you may need to make some compromises, and decide on a threshold above which to create a feature branch. But in general, it’s best to keep this threshold as low as you can get away with, or possibly even lower it gradually as you and your team-mates become more confident with branching and merging. Certainly, if you’re doing exclusively trunk-based development, you’re denying yourself a straight answer to the question of how often to check in code, and asking for problems sooner or later. Whatever SCM tool you are using, if you don’t know how to branch and merge with it, you should learn how to do so.

21
Dec

Finding bugs with a binary search of your source control history

Mercurial’s bisect command is a fantastically useful tool when you’re faced with a bug.

It’s a very simple idea. You start off with your latest revision, which you know has the bug, go back to a revision that you know didn’t have the bug, and do a binary search until you find the revision that introduced it.

So let’s say your latest revision was number 500. You’d mark that one as bad, then test, say, revision 100, find that it works as expected, and mark that as your last known good revision. Mercurial will then automatically update to revision number 300 (halfway in between) for you to test. Mark as good or bad as appropriate, lather, rinse and repeat until you find the change that introduced the bug.

With every test that you make, the difference between the “good” and “bad” revisions decreases by a half, quickly narrowing the gap:

bisect1

Consequently you will be able to pinpoint the breaking change after approximately log2 n tests, so a thousand revisions would only take one more test than 500, and a million would only take one more test than 500,000. Once you’ve found the offending change, you can very easily zoom right in on the problematic lines of code, rather than having to spend ages stepping through it all in the debugger.

You don’t need to be using Mercurial to apply this technique. You can do it manually with any version control tool, though you will need to keep a manual note of what’s what if it doesn’t provide you with the necessary tools to do it. It can also be pretty slow with centralised tools, since you have to hit the network for every test.

There are a couple of points to note with this procedure however.

First, bisect is most effective when your revisions are small and serve a single purpose. If the breaking revision changes a lot of code, and tackles too many things at once, it may be difficult to identify the source of the problem once you have located the offending change. This is why it is important to “check in early, check in often.” This is also why good, informative commit summaries are important.

Second, remember that you’re looking for the revision that introduced a specific bug. If a revision does not have this specific bug but has other problems, you should mark it as good nonetheless.

Revisions that don’t compile, or have other problems that prevent you from determining whether the bug exists in the first place, should not be marked as either “good” or “bad” but should be flagged to be skipped. In this case, your “last known good” and “first known bad” revisions won’t be updated, and the number of tests you have to make will increase, slowing down your search. Consequently it is good practice to ensure that every commit that you make to source control should build correctly and ideally also pass all your unit tests where possible. When you’re using a DVCS it can be tempting to disregard this altogether, but if hg bisect reports that your error is somewhere in a string of twenty successive revisions, none of which compiles, you’ll have more of a headache sorting out what’s what. Certainly, broken check-ins should be very much the exception rather than the rule.

04
Oct

Perforce Merge: a very nice free replacement for TortoiseMerge

No matter which source control tool you’re using, sooner or later you’ll encounter a merge conflict. When this happens, a decent graphical merge tool is a must-have.

There are two different types of merge tools. Two-way merge tools show you your version of the file and the other person’s version of the file side by side. Three-way merge tools also show you the original file in the middle. This helps clear up a lot of confusion, since you can see what the original file looked like before anyone did anything to it.

So far, I’ve been using TortoiseMerge as my merge tool of choice, since it comes with TortoiseSVN, it’s familiar, it’s reasonably usable, and it is not too ugly. The only downside is that it’s two-way, rather than three-way. TortoiseHg gives you kdiff3 by default instead, which is a three-way merge tool, but it’s an absolute eyesore and its usability leaves a lot to be desired. Up to now, I’ve always switched it out in favour of TortoiseMerge.

Recently I came across the Perforce merge tool P4Merge (hat tip: Novaleaf Game Studios) and I must say that I’m impressed. It gives a very clear, intuitive view of what’s changed, with a text editor underneath that lets you resolve the conflicts easily. The icons to the right hand side of the text editor allow you to select which version you want to cherry pick. Oh, and visually, it looks fantastic.

Perforce merge tool in action - click to view full size

P4Merge comes with the Perforce client tools which are a free download: if you’re not using Perforce itself for source control, select only the merge tool on the installation wizard and deselect everything else.

image

Once you’ve installed P4Merge, TortoiseHg will automatically detect it and list it as an option in the TortoiseHg configuration dialog or merge wizard. If you’re using Subversion or Git with their respective Tortoises, you need to specify the command line in the options dialog: Using a cool merge tool with SVN or GIT tells you how. Team Foundation Server is somewhat more complicated, but still doable: Using P4Merge with Visual Studio 2008 and TFS explains how to tackle it.

The only downside next to TortoiseMerge is that the option to cherry-pick changes only works on the block level, rather than on a line-by-line basis. However, since the resolution panel at the bottom is of course a free-form text editor, you can easily copy and paste as necessary, so this is no big deal. I think I’ll be using it as my merge tool of choice from now on.