Benchmarks

Real-world and synthetic merge benchmarks. Reproduce with weave bench-repo <path>.

weave vs git merge

git merges lines. mergiraf merges tree nodes. weave merges entities. Read the full deep dive →

Scenario git merge mergiraf weave merge
Two agents edit different functions CONFLICT (adjacent lines) auto-resolved auto-resolved
One adds function, one modifies another often conflicts auto-resolved auto-resolved
Both modify the same function identically CONFLICT auto-resolved detected as identical, uses either
Both modify the same function differently CONFLICT CONFLICT attempts 3-way merge on entity body
One deletes, one modifies silent data loss possible depends on context modify/delete conflict reported
Both add functions at same position CONFLICT CONFLICT auto-resolved (unordered entities)
Python: both add decorators to function CONFLICT CONFLICT auto-resolved (decorator bundling)

Methodology

How the benchmarks work.

1
Clone a real repo
We pick major open-source repos with long merge histories: git/git (C), Flask (Python), CPython (C/Python), Go (Go), TypeScript (TS).
2
Walk merge commits
For each merge commit with two parents, extract the base (merge-base), ours (parent 1), theirs (parent 2), and human result (the merge commit itself).
3
Replay each file merge
For every file that both parents touched, run git's line-level merge and weave's entity-level merge on the same (base, ours, theirs) triple.
4
Compare against human
A win is when git conflicts but weave resolves cleanly. A regression is when git resolves cleanly but weave's output differs from the human result. Human match checks if weave's output is identical to what the developer wrote.

Summary

Across 4,917 file merges from 5 repos, weave resolves 83 merges that git cannot, with 0 regressions on C, Python, and Go.

Repository Language Files Tested Both Clean Weave Wins Both Conflict Regressions Human Match
git/git C 1,319 1,009 39 271 0 64%
Flask Python 56 30 14 12 0 57%
CPython C / Python 256 201 7 48 0 29%
Go Go 1,247 1,000 19 228 0 58%
TypeScript TypeScript 1,639 1,340 4 292 3 75%

Synthetic benchmarks

31 hand-crafted merge scenarios across 7 languages. Run weave bench to reproduce.

Scenario weave mergiraf git
Different functions modifiedcleancleanclean
Different class methods modifiedcleancleanclean
Both add different imports (TS)cleancleanCONFLICT
Class: different methods among 4cleancleanclean
One adds, other modifiescleancleanclean
Adjacent function changescleancleanclean
Python: different class methodscleancleanclean
Python: adjacent methods (4-method class)cleancleanclean
Both add exports at end of filecleanCONFLICTCONFLICT
Reformat vs modify (whitespace-aware)cleancleanCONFLICT
Both add functions at end of filecleanCONFLICTCONFLICT
Both add methods to class at endcleancleanCONFLICT
Rust: both add different use statementscleancleanCONFLICT
Python: both add different importscleancleanCONFLICT
Class: modify method + add newcleancleanclean
Both add functions between existingcleanCONFLICTCONFLICT
Python: both add different decoratorscleanCONFLICTCONFLICT
Decorator + body changecleancleanclean
TS: class method decoratorscleanCONFLICTCONFLICT
TS: interface field additionscleancleanCONFLICT
Rust: enum variant additionscleancleanCONFLICT
Java: different methods in same classcleancleanclean
Java: both add annotationscleancleanCONFLICT
C: different functions modifiedcleancleanclean
TS: method reorder + modificationcleancleanclean
Python: both add class methodscleancleanCONFLICT
Rust: both add impl methodscleancleanCONFLICT
TS: enum modify + add variantcleancleanclean
TS: add JSDoc + modify bodycleancleanclean
Rust: both add doc comments to different fnscleancleanclean
Go: both add different functionscleancleanCONFLICT

weave: 31/31 clean (100%) vs mergiraf: 26/31 (83%) vs git: 15/31 (48%). Full benchmark suite runs in 11ms. Individual merges take 65-374µs. Entity extraction powered by sem-core.

git/git

The git source code itself. 1,319 file merges from 500 merge commits. Mostly C header and source files.

39
Wins
0
Regressions
64%
Human Match
13%
Resolution Rate

25 of 39 wins produce output identical to the human merge. The remaining 14 differ in entity ordering (e.g. weave places a struct above a function where the human placed it below). These are stylistic differences, not semantic errors.

Common win patterns: both branches add different extern declarations to a header, both branches add functions to different sections of a .c file, import block changes that git sees as overlapping lines.

Flask

Python web framework. 56 file merges from 500 merge commits. Highest resolution rate of all tested repos.

14
Wins
0
Regressions
57%
Human Match
54%
Resolution Rate

Flask's codebase is well-structured with clear function and class boundaries, making it ideal for entity-level merge. Over half of all git conflicts are resolved by weave. Common patterns: both branches modifying different methods in app.py, import additions to __init__.py.

CPython

The Python interpreter. 256 file merges from 500 merge commits. Mix of C source and Python test files.

7
Wins
0
Regressions
29%
Human Match
13%
Resolution Rate

Lower human match rate due to CPython's heavy use of macros and preprocessor directives in C code, which create entity ordering differences. The wins are clean: header file declarations and test method additions that git falsely conflicts on.

Go

The Go compiler and standard library. 1,247 file merges from 500 merge commits.

19
Wins
0
Regressions
58%
Human Match
28%
Resolution Rate

Go's explicit structure (top-level functions, clear type declarations) works well with entity-level merge. 58% human match rate. Common patterns: both branches adding different functions, struct field additions in different types.

TypeScript

The TypeScript compiler. 1,639 file merges from 500 merge commits. Highest human match rate but 3 regressions.

4
Wins
3
Regressions
75%
Human Match
4%
Resolution Rate

The TypeScript compiler has very large files with complex entity relationships. The 3 regressions are under investigation. The 75% human match rate (highest of all repos) shows that when weave does resolve, it closely matches developer intent.

Glossary

What the numbers mean.

TermDefinition
Files Tested Number of individual file merges where both branches touched the same file (both-touched files across all merge commits).
Both Clean Both git and weave merged cleanly. No conflict from either tool.
Win Git produced a conflict, but weave resolved cleanly. A false conflict eliminated.
Both Conflict Both git and weave produced conflicts. A real semantic collision that requires human judgment.
Regression Git merged cleanly, but weave produced a different result than the human. Weave introduced an error where git was fine.
Human Match Of the wins, how many produce output identical to what the developer actually wrote. Higher = weave's merge matches human intent.
Resolution Rate Wins / (Wins + Both Conflict). What percentage of git's conflicts weave eliminates.

Reproduce

Run the benchmarks yourself.

run benchmarks
# Clone a repo
$ git clone --bare https://github.com/git/git /tmp/git-bench

# Run benchmark (scans up to 500 merge commits)
$ weave bench-repo /tmp/git-bench

# Show diffs for non-matching cases
$ weave bench-repo /tmp/git-bench --show-diff

# Save base/ours/theirs/human/weave for each case
$ weave bench-repo /tmp/git-bench --save benchmarks/git