Building Diff Is Way Harder Than It Looks

Harshit SharmaMarch 31, 2026

6 min read

I tried to build diff. Compare two arrays, find what's different, done. Felt like a weekend project.

It took me down a rabbit hole that ended at graph theory. Turns out the algorithm behind git diff is solving a shortest-path problem on every commit, and the reason it works is not at all obvious. I figured I'd write up the journey.

It starts with a loop

Two identical sequences:

A: a b c
B: a b c

Nothing changed. Return nothing. Now change one element:

A: a b c
B: a x c

One line is different. Compare index by index, mark the change:

 a
-b
+x
 c

At this point the whole problem feels like a for loop:

for (let i = 0; i < Math.max(A.length, B.length); i++) {
  if (A[i] !== B[i]) {
    // mark as changed
  }
}

This works. Until it doesn't.

The first crack

Consider this:

A: a b c
B: a x b c

One line was inserted. That's it. But the loop sees something else entirely:

a == a  ✓
b != x  ✗
c != b  ✗
  != c  ✗

Everything after the insertion looks wrong, even though b and c didn't change at all. They shifted down by one position. Toggle between the two views to see the difference:

Naive loop - comparing A[i] vs B[i]

[0]a==a

[1]b!=x

[2]c!=b

—!=c

3 out of 4 comparisons fail, but only 1 thing actually changed

After an insertion, A[i] and B[i] no longer refer to the same logical line. The comparison becomes meaningless. Diff isn't about comparing positions. It's about finding alignment.

Trying to fix it

The natural instinct is to look ahead. If A[i] doesn't match B[i], maybe it matches B[i + 1]:

if (A[i] === B[i + 1]) {
  // probably an insertion in B
}
if (A[i + 1] === B[i]) {
  // probably a deletion from A
}

This handles one insertion. For a moment it feels like you've solved it.

Then you try two insertions:

A: a b c d
B: a x y b c d

Breaks again. Looking one step ahead can't handle a two-element shift. The naive loop reports 5 mismatches when there are only 2 actual changes:

Naive loop - comparing A[i] vs B[i]

[0]a==a

[1]b!=x

[2]c!=y

[3]d!=b

—!=c

—!=d

5 out of 6 comparisons fail, but only 2 things actually changed

You could look two steps ahead. Three. Write a loop that scans forward until it finds a match. But now you have a new problem: at every mismatch, both A[i] === B[i+n] (insertion) and A[i+n] === B[i] (deletion) might look valid. Which branch do you take?

Try both. But each branch can hit another ambiguity. And each of those branches can split again. You're building a tree of possibilities that grows exponentially. For two 1000-line files, you'd be there forever.

The problem isn't the lookahead distance. It's that you're guessing locally when the answer is global.

Reframing the problem

Here's the mental shift that makes everything else possible.

Stop asking "what changed?" and ask this instead:

How do I transform A into B using the fewest operations?

Three operations: insert, delete, and match. Matching is free. Inserts and deletes cost one each.

This turns a vague "find the differences" task into a concrete optimization problem: find the cheapest sequence of operations that converts A into B.

Turn it into a grid

This is where it clicked for me.

Put B on the horizontal axis, A on the vertical. You start at (0, 0), nothing consumed from either sequence. You want to reach (A.length, B.length), both fully consumed.

Three kinds of moves:

right = insert from B, costs 1
down = delete from A, costs 1
diagonal = match A[i] with B[j], free when they're equal

Start at (0, 0). Nothing consumed yet.

diagonal = match (free)

right = insert (cost 1)

down = delete (cost 1)

The best diff is the path with the most diagonals (matches) and fewest horizontal or vertical moves (edits). Shortest path from top-left to bottom-right. The whole diff problem collapses into graph traversal.

For those two sequences a b c → a x b c, the shortest path uses 3 diagonals and 1 right move. Three matches, one insert. That's the minimum diff.

A variable called k

While reading Eugene Myers' 1986 paper ("An O(ND) Difference Algorithm and Its Variations"), I hit a variable that seemed random at first:

k = x - y

It's not random. k tells you how far the two sequences are shifted relative to each other. k = 0 means perfectly aligned. Negative means B is ahead (insertions happened). Positive means A is ahead (deletions happened).

Each value of k is a diagonal line across the grid. When you match, both x and y increment, so k stays the same. You stay on the same diagonal. Only edits, inserts and deletes, move you to a different one.

Perfectly aligned, no shift between A and B

k = x - y = 0

So instead of tracking every (x, y) position in the grid, you track how far you've reached along each diagonal. One number per diagonal instead of a full coordinate pair.

The algorithm

Myers' algorithm stores a single array V where V[k] is the farthest x-position reached on diagonal k. That's the whole state. No matrix, no backtracking table. One array.

It works in rounds. Round d asks: "can I reach the end with exactly d edits?"

Each round does three things. For each reachable diagonal, extend from a neighboring diagonal (one insert or delete). Then slide along the diagonal as far as matching allows. Then check if any diagonal reached the end.

Start with d = 0 (pure matches, no edits). If that doesn't reach the end, try d = 1. Then d = 2. The first round that hits the bottom-right corner gives you the minimum edit distance.

Begin at (0, 0). Nothing consumed from either sequence yet.

Start

You never guess. You expand the frontier one edit at a time, keep only the best progress per diagonal, and stop the moment you're done. For two sequences of length N with D differences, this runs in O(N * D) time. When the differences are small, and they usually are, that's the whole point of diffing, it's fast.

One thing the visualizer doesn't show: to get the actual +/- output, you save a snapshot of V after each round. Once you reach the end, you walk backwards through those snapshots to reconstruct which moves you took. That's how you go from "edit distance is 1" to an actual diff.

From algorithm to `git diff`

Myers gives you the core: minimum edit distance on two sequences. But git diff operates on files with thousands of lines, and it does more than just run Myers raw.

Lines get hashed to integers before comparison, so the algorithm compares hashes instead of doing string equality on every line. Git also supports alternative strategies: patience diff (anchors on unique matching lines first, then runs Myers on the gaps) and histogram diff (a faster variant of patience). And rename detection is a separate pass entirely, matching files across renames by content similarity.

Myers is the foundation. The rest is engineering to make it practical at scale.

Why this matters

git diff uses Myers' algorithm. So does GNU diff and most diff tools you've used. Every green and red line in every pull request you've reviewed runs through some version of this.

I went in thinking I'd write a loop. I came out thinking about edit graphs and shortest paths. The thing that surprised me most: a good diff algorithm doesn't look for what changed. It looks for what stayed the same, and treats everything else as cost.

If you want to go deeper: Myers' original paper is freely available and readable. The diff-match-patch library by Google is a clean reference implementation. And Git's own xdiff implementation shows how this works at scale.

Try it yourself

Type two short strings and watch the edit graph and shortest path update live. The colored path shows how the algorithm transforms A into B.

Type two strings and see the shortest edit path computed live.

a+xbc

edit distance: 13 matches

Written by Harshit Sharma. If you want to know when new posts are out, follow me on Twitter.