Myers Gives You a Shortest Path, Not the Best One
In the last post, I walked through how Myers' algorithm works and why building diff is harder than it looks. The whole thing boils down to finding the shortest path through an edit graph.
After publishing, one question kept bugging me: what happens when there's more than one shortest path?
The illusion of one correct diff
You compare two sequences. You expect one answer. One diff. You changed these lines, here they are.
Except that's not how it works.
A: A B C B A
B: B A C A BRun Myers on this. You get a valid shortest edit path:
+B
A
-B
C
+A
B
-AEdit distance: 4. Minimal. But does it feel right?
A human looking at the same pair would probably say: "everything shifted around C." Myers doesn't see that. It sees characters, positions, costs.
Why: there are many shortest paths
The edit graph from the last post had one clear shortest path in most examples. I did that on purpose, it made the algorithm easier to explain. But in practice, the graph often has multiple paths with the same cost.
Diff solves: "find the shortest sequence of edits." When multiple sequences tie, the algorithm picks one. Myers picks whichever path its greedy diagonal-first traversal lands on first.
Toggle between the paths below. Same input, same edit distance, completely different diffs:
Why Myers picks that path
To see why, look at how Myers actually explores the edit graph.
It works in rounds. Round d = 0: can I reach the end with zero edits? No. Round d = 1: one edit? No. Round d = 2: two edits? Keep going.
This is breadth-first search. It expands outward one edit at a time, and the first path that reaches the bottom-right corner wins.
Not the cleanest path. Not the most readable path. The first one that's short enough.
Myers doesn't compare candidates. It finds one that works and stops. The path it returns depends entirely on which diagonal it happened to extend first at each round. Change the implementation slightly, reverse the scan order, break ties differently, and you get a different diff from the same input.
Diff algorithms don't understand your code
There's a second problem, separate from the BFS one.
Myers treats every line as a plain string. It sees this:
}and this:
function processPayment() {as equally important. One is a closing brace that appears fifty times in the file. The other is a unique function signature that a human would immediately anchor on.
Myers doesn't know the difference. It has no concept of functions, blocks, or logical units. A closing brace is just as good an anchor as a function declaration. So when duplicates exist, it picks whichever match it hits first during traversal.
Three different ways to match the same characters. All produce valid minimum-edit diffs. The algorithm just happened to land on one of them.
Where this actually hurts
You've probably seen something like this in a pull request:
def get_users():
users = db.query("SELECT * FROM users")
return users
def get_posts():
posts = db.query("SELECT * FROM posts")
return postsNow add a function between them:
def get_users():
users = db.query("SELECT * FROM users")
return users
def get_comments():
comments = db.query("SELECT * FROM comments")
return comments
def get_posts():
posts = db.query("SELECT * FROM posts")
return postsMyers might show the new function inserted cleanly. Or it might match return posts to return comments (similar structure), and show a messy diff where get_posts looks partially deleted and re-added.
Both valid shortest paths. The second just happens to align on the wrong duplicate lines. Same edit distance. Completely different story.
What better diffs do
The edit distance is deterministic. Given two sequences, there's exactly one minimum number of edits. That's a fact. But the actual diff output, which lines match which, is a choice. Myers makes that choice blindly.
Other strategies try to be smarter about it.
Patience diff finds lines that appear exactly once in both files. Those unique lines become fixed anchors. Then it runs Myers on the gaps between anchors. The algorithm never has to guess which return null matches which, because it only anchors on lines that aren't ambiguous.
Histogram diff works similarly but faster. It builds a frequency table of all lines and prioritizes low-frequency ones as anchors. Behaves like patience diff in practice, handles more edge cases.
Git supports all three:
git diff # default (Myers)
git diff --patience # anchor on unique lines
git diff --histogram # frequency-based anchoringTry them on a file with repeated closing braces or blank lines. The output format stays the same. What changes is alignment: which lines get matched to which.
Try it yourself
The more duplicates in your input, the more valid shortest paths exist. Try strings where every character repeats.
The real takeaway
A diff is not just a transformation. It's a story about what changed.
Myers tells the shortest story. Not necessarily the clearest one. Patience diff anchors on unique landmarks to tell a story that makes more sense to a human. Histogram diff does the same, faster. None of them are wrong. They just make different editorial choices about how to explain the same transformation.
Myers optimizes for edit distance. Humans optimize for understanding.
If you're building a diff tool, you're not just computing edits. You're picking which version of the change to show someone. The hardest part isn't finding the shortest path. It's deciding which shortest path to show.
Written by Harshit Sharma. If you want to know when new posts are out, follow me on Twitter.