5

Assume that a node A in the commit tree of a codebase contains a bug but some ancestor B of A is clean from that very bug. Given the topology of the commit tree [B,A] leading from B to A, can we predict c the maximum number of steps needed to locate the commit where the bug appeared, in terms of number of vertices, arrows, merges or cycles and other simple graph invariants?

There is two easy cases:

  1. If the history from B to A is linear, then c is the binary logarithm of the number of nodes between B and A.

  2. If the history is totally parallel, i.e. a family of independent commits A → Mᵢ → B, then c is the number of such commits.

From the two examples, we learn that linear history leads to cheaper bisection processes, but can we be more precise than this? It is easy to write a program to compute c – which I still haven't done yet – but I am looking for a pretty algebraic formula. If the problem is hard, it could be interesting and easier to have expected value of c given, for instance, the number of vertices, arrows and merges, or something similar.

Michaël Le Barbier
  • 2,025
  • 14
  • 25
  • The mathy way to ask the question would go as “Given a finite ordered lattice which is an interval, we consider sets of intervals on this lattice, such that each point in the lattice can be described as an intersection of some of these intervals. Which is the smallest cardinal such a set can have?” – Michaël Le Barbier Apr 26 '15 at 16:38
  • 1
    If I got this right, if you have 2 parallel commits, you may not be able to assign the bug to one of the commits - it is totally possible that the bug only results from the combination of the two. – Doc Brown Apr 26 '15 at 16:47
  • @DocBrown – This is definitely true! But let us assume that this is not the case, i.e. we consider the optimistic case where bisecting will *actually* find the bug! EDIT Well actually, this just means that the commit look for is *A* itself! :) – Michaël Le Barbier Apr 26 '15 at 16:50
  • Still having trouble to understand what happens with parallel commits. If we assume the bug was introduced in exactly one of the `M_i`, then you can just pick an arbitrary order for them, and deal with them like in case 1. And linearization is always possible for a directed, acyclic graph. – Doc Brown Apr 26 '15 at 18:21
  • We want to find the first bug where the bug appears. In the case of parallel commits, we may need to examine each commit to conclude that the bug appeared in one *Mᵢ* or in *A*. In the case of a linear history, one can skip some commits. – Michaël Le Barbier Apr 26 '15 at 18:25
  • *And linearization is always possible for a directed, acyclic graph.* Do you refer to topological sort? This is not an option here because if I introduce an arrow in the graph, I break the implicit assumption that once the bug went in, it stays there. – Michaël Le Barbier Apr 26 '15 at 18:27
  • "We want to find the first bug where the bug appears" - (you surely meant "the first commit"). Excactly. Pick an arbitrary order, apply the bisection, it will tell you which of the M_i is the cause. What am I missing? – Doc Brown Apr 26 '15 at 18:29
  • 1
    Related: http://cstheory.stackexchange.com/questions/8923/binary-search-generalizations-for-posets/8943#8943 http://cs.stackexchange.com/questions/22451/is-the-algorithm-implemented-by-git-bisect-optimal – Winston Ewert Apr 26 '15 at 23:45
  • 1
    @DocBrown If you choose the wrong order, the bug can disappear and then reappear in the linear ordering. Bisect at its most basic works as a binary search; those intermediate "fixed" commits would make it go in the wrong direction if it landed on one of them, so I don't think the simplification works here – Izkata Apr 27 '15 at 00:16
  • @Izkata: the whole question makes only sense under the assumption that the bug does not disappear by a later commit. Without that assumption, bisection would not work for an already linear history tree. Furthermore, the **definition** of commits beeing independent is "the order in which they are applied is irrelevant", so the outcome of the commits **cannot** depend on the order, otherwise the commits were not indepedent in first place. – Doc Brown Apr 27 '15 at 06:55
  • 1
    @DocBrown This is awkward to do in comments, but if the commit tree is `A -> B -> D` and `A -> C -> D` (a diamond), and the bug is in `B` and `D`, then your arbitrary linearization cannot choose `A, B, C, D` and still work - even though `B` and `C` are still independent – Izkata Apr 27 '15 at 23:21
  • 1
    @Izkata: That depends on how commits are stored in the SCM system. If they are stored as snapshots, then your objections are valid. But if they are stored as differences, then the linearisation `A, B, C, D` would have the bug also show up in `C'` (the snapshot created by applying differences `B` and `C` to the snapshot at `A`). – Bart van Ingen Schenau Apr 28 '15 at 05:30
  • @Izkata: why not? If the bug is introduce in "commit B", it is in the codebase after you merged B into the trunk, and it stays in there if you merge C afterwards, if you only merge D afterwards, or both. Note that the test for the presence of a bug cannot not done by looking at a specific commit (the differences applied), but by testing the resulting code base after merging the commit to it. – Doc Brown Apr 28 '15 at 05:32
  • @BartvanIngenSchenau: you got my point, though I think it does not depend on how the SCM stores something internally. The important part is the SCM model, as it is seen from a user of the SCM. I assume we are talking of changesets here. Each commit is a changeset, and to validate the presence of a bug one has to apply the changesets to a base revision first, then run a build step and a test which shows up a bug. And we are assuming once a bug is introduced by a specific changeset, it stays in the resulting code. – Doc Brown Apr 28 '15 at 05:37
  • As a quick guess - if you assume you can only merge 1 branch into another (no merging of 3+ branches together), then you can easily check in at most `O(h + log(n))` where `n` is the *longest no-merges spree* in the commit tree, and `h` is the largest number of *merges* that a commit may have needed to go through from start to your known buggy build. This would be achievable by dividing your commit tree into a proper binary tree, and then pruning one branch at a time. (Note - this means that in your second case the time is from `log(n)` to `n` depending on the order of merges) – Ordous May 21 '15 at 15:48

1 Answers1

3

Files in various code versioning systems typically gain bugs at one commit and remain bugged until they are fixed in another commit. This is because there are two popular methods for using versioning: file locks and merges. In the case of a file lock, developers lock files they're working on so nobody else can make changes. After unlocking, the other developers must then checkout the same file, ensuring that those changes won't be lost.

In a merging code versioning system, the main branch stores all of the commits from all branches merged into it, creating a linear history. This usually works on the concepts of diffs/patches, meaning that a commit that introduces a bug from one branch won't be automatically wiped out from a commit in another branch that is merged in later. In some rare cases, conflicts occur, and then a manual resolution is required. Bugs usually survive such merges, too, so they still appear continuously through the master branch's timeline.

No matter how many branches or developers you have working on a project, you can practically guarantee that simply by identifying the last known good commit before the bug, and last known bad commit for a bug (usually the current ref position), you can clearly isolate the commit that caused the bug, no matter which branch it occurred in. This is a side-effect of the linear nature of the logs and how they integrate to the main branch.

You can always identify the maximum search time for finding a bug by counting the number of commits from the last known good to the last known bad commit, and finding log2(x). You can often search an entire repository's history for a bug in well less than 20 iterations (16,000 commits, for example, is just 14 iterations). If you know the release a bug appeared in, that's usually in the order of about 5-10 iterations (your mileage may vary).

I only have experience with SVN and Git, but it's clear that any system that uses something better than "zip all the files in the repo into a hidden commit folder" (e.g. using literal file system snapshots) should have roughly the same logarithmic performance for finding bugs. There will always be just one linear log that needs to be read, ignoring all branches, even those that were later discarded or rebased, because the timeline will always appear linear.

For a crazy example, take a look at this image:

Git Merge Hell

In this image, they show how Git works. And while Git is doing the job it was given, this is not a pretty sight. However, you'll notice how each commit nicely lines up linearly, despite merging from a ton of different branches at different times. Of course, there are tools to fix this history so its legible again (for example, this article about how they ended up in that mess).

The point is, the versioning system will keep things straight for you, so a simple binary search in a simple linear history is really all that's required.

phyrfox
  • 529
  • 3
  • 6
  • Can you clarify “so a linear search is really all that's required” ? It seems that you mean that each case can be replaced (for bisecting purposes) by an equivalent linear case, which is wrong. – Michaël Le Barbier May 21 '15 at 11:48
  • You're right. I meant to say that a simple binary search along a linear history is ask that's required to find bugs. Edited. – phyrfox May 21 '15 at 13:48