Benchmarks

Real-world and synthetic merge benchmarks. Reproduce with weave bench-repo <path>.

Methodology

How the benchmarks work.

1
Clone a real repo
We pick major open-source repos with long merge histories: git/git (C), Flask (Python), CPython (C/Python), Go (Go), TypeScript (TS).
2
Walk merge commits
For each merge commit with two parents, extract the base (merge-base), ours (parent 1), theirs (parent 2), and human result (the merge commit itself).
3
Replay each file merge
For every file that both parents touched, run git's line-level merge and weave's entity-level merge on the same (base, ours, theirs) triple.
4
Compare against human
A win is when git conflicts but weave resolves cleanly. A regression is when git resolves cleanly but weave's output differs from the human result. Human match checks if weave's output is identical to what the developer wrote.

Summary

Across 4,917 file merges from 5 repos, weave resolves 83 merges that git cannot, with 0 regressions on C, Python, and Go.

Repository Language Files Tested Both Clean Weave Wins Both Conflict Regressions Human Match
git/git C 1,319 1,009 39 271 0 64%
Flask Python 56 30 14 12 0 57%
CPython C / Python 256 201 7 48 0 29%
Go Go 1,247 1,000 19 228 0 58%
TypeScript TypeScript 1,639 1,340 4 292 3 75%

git/git

The git source code itself. 1,319 file merges from 500 merge commits. Mostly C header and source files.

39
Wins
0
Regressions
64%
Human Match
13%
Resolution Rate

25 of 39 wins produce output identical to the human merge. The remaining 14 differ in entity ordering (e.g. weave places a struct above a function where the human placed it below). These are stylistic differences, not semantic errors.

Common win patterns: both branches add different extern declarations to a header, both branches add functions to different sections of a .c file, import block changes that git sees as overlapping lines.

Flask

Python web framework. 56 file merges from 500 merge commits. Highest resolution rate of all tested repos.

14
Wins
0
Regressions
57%
Human Match
54%
Resolution Rate

Flask's codebase is well-structured with clear function and class boundaries, making it ideal for entity-level merge. Over half of all git conflicts are resolved by weave. Common patterns: both branches modifying different methods in app.py, import additions to __init__.py.

CPython

The Python interpreter. 256 file merges from 500 merge commits. Mix of C source and Python test files.

7
Wins
0
Regressions
29%
Human Match
13%
Resolution Rate

Lower human match rate due to CPython's heavy use of macros and preprocessor directives in C code, which create entity ordering differences. The wins are clean: header file declarations and test method additions that git falsely conflicts on.

Go

The Go compiler and standard library. 1,247 file merges from 500 merge commits.

19
Wins
0
Regressions
58%
Human Match
28%
Resolution Rate

Go's explicit structure (top-level functions, clear type declarations) works well with entity-level merge. 58% human match rate. Common patterns: both branches adding different functions, struct field additions in different types.

TypeScript

The TypeScript compiler. 1,639 file merges from 500 merge commits. Highest human match rate but 3 regressions.

4
Wins
3
Regressions
75%
Human Match
4%
Resolution Rate

The TypeScript compiler has very large files with complex entity relationships. The 3 regressions are under investigation. The 75% human match rate (highest of all repos) shows that when weave does resolve, it closely matches developer intent.

Glossary

What the numbers mean.

TermDefinition
Files Tested Number of individual file merges where both branches touched the same file (both-touched files across all merge commits).
Both Clean Both git and weave merged cleanly. No conflict from either tool.
Win Git produced a conflict, but weave resolved cleanly. A false conflict eliminated.
Both Conflict Both git and weave produced conflicts. A real semantic collision that requires human judgment.
Regression Git merged cleanly, but weave produced a different result than the human. Weave introduced an error where git was fine.
Human Match Of the wins, how many produce output identical to what the developer actually wrote. Higher = weave's merge matches human intent.
Resolution Rate Wins / (Wins + Both Conflict). What percentage of git's conflicts weave eliminates.

Reproduce

Run the benchmarks yourself.

run benchmarks
# Clone a repo
$ git clone --bare https://github.com/git/git /tmp/git-bench

# Run benchmark (scans up to 500 merge commits)
$ weave bench-repo /tmp/git-bench

# Show diffs for non-matching cases
$ weave bench-repo /tmp/git-bench --show-diff

# Save base/ours/theirs/human/weave for each case
$ weave bench-repo /tmp/git-bench --save benchmarks/git