Building Diff Is Way Harder Than It Looks
I tried to build diff. Compare two arrays, find what's different, done. Felt like a weekend project.
It took me down a rabbit hole that ended at graph theory. Turns out the algorithm behind git diff is solving a shortest-path problem on every commit, and the reason it works is not at all obvious. I figured I'd write up the journey.
It starts with a loop
Two identical sequences:
A: a b c
B: a b cNothing changed. Return nothing. Now change one element:
A: a b c
B: a x cOne line is different. Compare index by index, mark the change:
a
-b
+x
cAt this point the whole problem feels like a for loop:
for (let i = 0; i < Math.max(A.length, B.length); i++) {
if (A[i] !== B[i]) {
// mark as changed
}
}This works. Until it doesn't.
The first crack
Consider this:
A: a b c
B: a x b cOne line was inserted. That's it. But the loop sees something else entirely:
a == a ✓
b != x ✗
c != b ✗
!= c ✗Everything after the insertion looks wrong, even though b and c didn't change at all. They shifted down by one position. Toggle between the two views to see the difference:
After an insertion, A[i] and B[i] no longer refer to the same logical line. The comparison becomes meaningless. Diff isn't about comparing positions. It's about finding alignment.
Trying to fix it
The natural instinct is to look ahead. If A[i] doesn't match B[i], maybe it matches B[i + 1]:
if (A[i] === B[i + 1]) {
// probably an insertion in B
}
if (A[i + 1] === B[i]) {
// probably a deletion from A
}This handles one insertion. For a moment it feels like you've solved it.
Then you try two insertions:
A: a b c d
B: a x y b c dBreaks again. Looking one step ahead can't handle a two-element shift. The naive loop reports 5 mismatches when there are only 2 actual changes:
You could look two steps ahead. Three. Write a loop that scans forward until it finds a match. But now you have a new problem: at every mismatch, both A[i] === B[i+n] (insertion) and A[i+n] === B[i] (deletion) might look valid. Which branch do you take?
Try both. But each branch can hit another ambiguity. And each of those branches can split again. You're building a tree of possibilities that grows exponentially. For two 1000-line files, you'd be there forever.
The problem isn't the lookahead distance. It's that you're guessing locally when the answer is global.
Reframing the problem
Here's the mental shift that makes everything else possible.
Stop asking "what changed?" and ask this instead:
How do I transform A into B using the fewest operations?
Three operations: insert, delete, and match. Matching is free. Inserts and deletes cost one each.
This turns a vague "find the differences" task into a concrete optimization problem: find the cheapest sequence of operations that converts A into B.
Turn it into a grid
This is where it clicked for me.
Put B on the horizontal axis, A on the vertical. You start at (0, 0), nothing consumed from either sequence. You want to reach (A.length, B.length), both fully consumed.
Three kinds of moves:
- right = insert from B, costs 1
- down = delete from A, costs 1
- diagonal = match A[i] with B[j], free when they're equal
The best diff is the path with the most diagonals (matches) and fewest horizontal or vertical moves (edits). Shortest path from top-left to bottom-right. The whole diff problem collapses into graph traversal.
For those two sequences a b c → a x b c, the shortest path uses 3 diagonals and 1 right move. Three matches, one insert. That's the minimum diff.
A variable called k
While reading Eugene Myers' 1986 paper ("An O(ND) Difference Algorithm and Its Variations"), I hit a variable that seemed random at first:
k = x - yIt's not random. k tells you how far the two sequences are shifted relative to each other. k = 0 means perfectly aligned. Negative means B is ahead (insertions happened). Positive means A is ahead (deletions happened).
Each value of k is a diagonal line across the grid. When you match, both x and y increment, so k stays the same. You stay on the same diagonal. Only edits, inserts and deletes, move you to a different one.
So instead of tracking every (x, y) position in the grid, you track how far you've reached along each diagonal. One number per diagonal instead of a full coordinate pair.
The algorithm
Myers' algorithm stores a single array V where V[k] is the farthest x-position reached on diagonal k. That's the whole state. No matrix, no backtracking table. One array.
It works in rounds. Round d asks: "can I reach the end with exactly d edits?"
Each round does three things. For each reachable diagonal, extend from a neighboring diagonal (one insert or delete). Then slide along the diagonal as far as matching allows. Then check if any diagonal reached the end.
Start with d = 0 (pure matches, no edits). If that doesn't reach the end, try d = 1. Then d = 2. The first round that hits the bottom-right corner gives you the minimum edit distance.
You never guess. You expand the frontier one edit at a time, keep only the best progress per diagonal, and stop the moment you're done. For two sequences of length N with D differences, this runs in O(N * D) time. When the differences are small, and they usually are, that's the whole point of diffing, it's fast.
One thing the visualizer doesn't show: to get the actual +/- output, you save a snapshot of V after each round. Once you reach the end, you walk backwards through those snapshots to reconstruct which moves you took. That's how you go from "edit distance is 1" to an actual diff.
From algorithm to git diff
Myers gives you the core: minimum edit distance on two sequences. But git diff operates on files with thousands of lines, and it does more than just run Myers raw.
Lines get hashed to integers before comparison, so the algorithm compares hashes instead of doing string equality on every line. Git also supports alternative strategies: patience diff (anchors on unique matching lines first, then runs Myers on the gaps) and histogram diff (a faster variant of patience). And rename detection is a separate pass entirely, matching files across renames by content similarity.
Myers is the foundation. The rest is engineering to make it practical at scale.
Why this matters
git diff uses Myers' algorithm. So does GNU diff and most diff tools you've used. Every green and red line in every pull request you've reviewed runs through some version of this.
I went in thinking I'd write a loop. I came out thinking about edit graphs and shortest paths. The thing that surprised me most: a good diff algorithm doesn't look for what changed. It looks for what stayed the same, and treats everything else as cost.
If you want to go deeper: Myers' original paper is freely available and readable. The diff-match-patch library by Google is a clean reference implementation. And Git's own xdiff implementation shows how this works at scale.
Try it yourself
Type two short strings and watch the edit graph and shortest path update live. The colored path shows how the algorithm transforms A into B.
Written by Harshit Sharma. If you want to know when new posts are out, follow me on Twitter.