Myers Gives You a Shortest Path, Not the Best One

Harshit Sharma

5 min read

In the last post, I walked through how Myers' algorithm works and why building diff is harder than it looks. The whole thing boils down to finding the shortest path through an edit graph.

After publishing, one question kept bugging me: what happens when there's more than one shortest path?

The illusion of one correct diff

You compare two sequences. You expect one answer. One diff. You changed these lines, here they are.

Except that's not how it works.

A: A B C B A
B: B A C A B

Run Myers on this. You get a valid shortest edit path:

+B
 A
-B
 C
+A
 B
-A

Edit distance: 4. Minimal. But does it feel right?

A human looking at the same pair would probably say: "everything shifted around C." Myers doesn't see that. It sees characters, positions, costs.

Why: there are many shortest paths

The edit graph from the last post had one clear shortest path in most examples. I did that on purpose, it made the algorithm easier to explain. But in practice, the graph often has multiple paths with the same cost.

Diff solves: "find the shortest sequence of edits." When multiple sequences tie, the algorithm picks one. Myers picks whichever path its greedy diagonal-first traversal lands on first.

Toggle between the paths below. Same input, same edit distance, completely different diffs:

All paths below have the same edit distance. Watch how the diff output changes anyway.
ABCBABACAB
+BA-BC+AB-A
edit distance: 4|4 valid paths found|3 matches, 4 edits

Why Myers picks that path

To see why, look at how Myers actually explores the edit graph.

It works in rounds. Round d = 0: can I reach the end with zero edits? No. Round d = 1: one edit? No. Round d = 2: two edits? Keep going.

This is breadth-first search. It expands outward one edit at a time, and the first path that reaches the bottom-right corner wins.

Not the cleanest path. Not the most readable path. The first one that's short enough.

Myers doesn't compare candidates. It finds one that works and stops. The path it returns depends entirely on which diagonal it happened to extend first at each round. Change the implementation slightly, reverse the scan order, break ties differently, and you get a different diff from the same input.

Diff algorithms don't understand your code

There's a second problem, separate from the BFS one.

Myers treats every line as a plain string. It sees this:

}

and this:

function processPayment() {

as equally important. One is a closing brace that appears fifty times in the file. The other is a unique function signature that a human would immediately anchor on.

Myers doesn't know the difference. It has no concept of functions, blocks, or logical units. A closing brace is just as good an anchor as a function declaration. So when duplicates exist, it picks whichever match it hits first during traversal.

Three ways to match the same characters. All equally valid.
ABCBABACABAB
Lines cross. The A's swap partners, C stays fixed
unique match
could go either way

Three different ways to match the same characters. All produce valid minimum-edit diffs. The algorithm just happened to land on one of them.

Where this actually hurts

You've probably seen something like this in a pull request:

def get_users():
    users = db.query("SELECT * FROM users")
    return users
 
def get_posts():
    posts = db.query("SELECT * FROM posts")
    return posts

Now add a function between them:

def get_users():
    users = db.query("SELECT * FROM users")
    return users
 
def get_comments():
    comments = db.query("SELECT * FROM comments")
    return comments
 
def get_posts():
    posts = db.query("SELECT * FROM posts")
    return posts

Myers might show the new function inserted cleanly. Or it might match return posts to return comments (similar structure), and show a messy diff where get_posts looks partially deleted and re-added.

Myers matched return posts to return comments, so get_posts looks deleted and re-added
1 def get_users():
2 users = db.query("SELECT * FROM users")
3 return users
4  
-def get_posts():
- posts = db.query("SELECT * FROM posts")
- return posts
5+def get_comments():
6+ comments = db.query("SELECT * FROM comments")
7+ return comments
8+ 
9+def get_posts():
10+ posts = db.query("SELECT * FROM posts")
11+ return posts
3 deletions + 7 insertions. get_posts looks destroyed and re-added

Both valid shortest paths. The second just happens to align on the wrong duplicate lines. Same edit distance. Completely different story.

What better diffs do

The edit distance is deterministic. Given two sequences, there's exactly one minimum number of edits. That's a fact. But the actual diff output, which lines match which, is a choice. Myers makes that choice blindly.

Other strategies try to be smarter about it.

Patience diff finds lines that appear exactly once in both files. Those unique lines become fixed anchors. Then it runs Myers on the gaps between anchors. The algorithm never has to guess which return null matches which, because it only anchors on lines that aren't ambiguous.

Histogram diff works similarly but faster. It builds a frequency table of all lines and prioritizes low-frequency ones as anchors. Behaves like patience diff in practice, handles more edge cases.

Git supports all three:

git diff                  # default (Myers)
git diff --patience       # anchor on unique lines
git diff --histogram      # frequency-based anchoring

Try them on a file with repeated closing braces or blank lines. The output format stays the same. What changes is alignment: which lines get matched to which.

Try it yourself

The more duplicates in your input, the more valid shortest paths exist. Try strings where every character repeats.

More duplicates = more ambiguity. Try strings where every character repeats.
+BA-BC+AB-A
edit distance: 44 valid paths+has duplicates

The real takeaway

A diff is not just a transformation. It's a story about what changed.

Myers tells the shortest story. Not necessarily the clearest one. Patience diff anchors on unique landmarks to tell a story that makes more sense to a human. Histogram diff does the same, faster. None of them are wrong. They just make different editorial choices about how to explain the same transformation.

Myers optimizes for edit distance. Humans optimize for understanding.

If you're building a diff tool, you're not just computing edits. You're picking which version of the change to show someone. The hardest part isn't finding the shortest path. It's deciding which shortest path to show.


Written by Harshit Sharma. If you want to know when new posts are out, follow me on Twitter.