Real-world and synthetic merge benchmarks. Reproduce with weave bench-repo <path>.
git merges lines. mergiraf merges tree nodes. weave merges entities. Read the full deep dive →
| Scenario | git merge | mergiraf | weave merge |
|---|---|---|---|
| Two agents edit different functions | CONFLICT (adjacent lines) | auto-resolved | auto-resolved |
| One adds function, one modifies another | often conflicts | auto-resolved | auto-resolved |
| Both modify the same function identically | CONFLICT | auto-resolved | detected as identical, uses either |
| Both modify the same function differently | CONFLICT | CONFLICT | attempts 3-way merge on entity body |
| One deletes, one modifies | silent data loss possible | depends on context | modify/delete conflict reported |
| Both add functions at same position | CONFLICT | CONFLICT | auto-resolved (unordered entities) |
| Python: both add decorators to function | CONFLICT | CONFLICT | auto-resolved (decorator bundling) |
How the benchmarks work.
Across 4,917 file merges from 5 repos, weave resolves 83 merges that git cannot, with 0 regressions on C, Python, and Go.
| Repository | Language | Files Tested | Both Clean | Weave Wins | Both Conflict | Regressions | Human Match |
|---|---|---|---|---|---|---|---|
| git/git | C | 1,319 | 1,009 | 39 | 271 | 0 | 64% |
| Flask | Python | 56 | 30 | 14 | 12 | 0 | 57% |
| CPython | C / Python | 256 | 201 | 7 | 48 | 0 | 29% |
| Go | Go | 1,247 | 1,000 | 19 | 228 | 0 | 58% |
| TypeScript | TypeScript | 1,639 | 1,340 | 4 | 292 | 3 | 75% |
31 hand-crafted merge scenarios across 7 languages. Run weave bench to reproduce.
| Scenario | weave | mergiraf | git |
|---|---|---|---|
| Different functions modified | clean | clean | clean |
| Different class methods modified | clean | clean | clean |
| Both add different imports (TS) | clean | clean | CONFLICT |
| Class: different methods among 4 | clean | clean | clean |
| One adds, other modifies | clean | clean | clean |
| Adjacent function changes | clean | clean | clean |
| Python: different class methods | clean | clean | clean |
| Python: adjacent methods (4-method class) | clean | clean | clean |
| Both add exports at end of file | clean | CONFLICT | CONFLICT |
| Reformat vs modify (whitespace-aware) | clean | clean | CONFLICT |
| Both add functions at end of file | clean | CONFLICT | CONFLICT |
| Both add methods to class at end | clean | clean | CONFLICT |
| Rust: both add different use statements | clean | clean | CONFLICT |
| Python: both add different imports | clean | clean | CONFLICT |
| Class: modify method + add new | clean | clean | clean |
| Both add functions between existing | clean | CONFLICT | CONFLICT |
| Python: both add different decorators | clean | CONFLICT | CONFLICT |
| Decorator + body change | clean | clean | clean |
| TS: class method decorators | clean | CONFLICT | CONFLICT |
| TS: interface field additions | clean | clean | CONFLICT |
| Rust: enum variant additions | clean | clean | CONFLICT |
| Java: different methods in same class | clean | clean | clean |
| Java: both add annotations | clean | clean | CONFLICT |
| C: different functions modified | clean | clean | clean |
| TS: method reorder + modification | clean | clean | clean |
| Python: both add class methods | clean | clean | CONFLICT |
| Rust: both add impl methods | clean | clean | CONFLICT |
| TS: enum modify + add variant | clean | clean | clean |
| TS: add JSDoc + modify body | clean | clean | clean |
| Rust: both add doc comments to different fns | clean | clean | clean |
| Go: both add different functions | clean | clean | CONFLICT |
weave: 31/31 clean (100%) vs mergiraf: 26/31 (83%) vs git: 15/31 (48%). Full benchmark suite runs in 11ms. Individual merges take 65-374µs. Entity extraction powered by sem-core.
The git source code itself. 1,319 file merges from 500 merge commits. Mostly C header and source files.
25 of 39 wins produce output identical to the human merge. The remaining 14 differ in entity ordering (e.g. weave places a struct above a function where the human placed it below). These are stylistic differences, not semantic errors.
Common win patterns: both branches add different extern declarations to a header, both branches add functions to different sections of a .c file, import block changes that git sees as overlapping lines.
Python web framework. 56 file merges from 500 merge commits. Highest resolution rate of all tested repos.
Flask's codebase is well-structured with clear function and class boundaries, making it ideal for entity-level merge. Over half of all git conflicts are resolved by weave. Common patterns: both branches modifying different methods in app.py, import additions to __init__.py.
The Python interpreter. 256 file merges from 500 merge commits. Mix of C source and Python test files.
Lower human match rate due to CPython's heavy use of macros and preprocessor directives in C code, which create entity ordering differences. The wins are clean: header file declarations and test method additions that git falsely conflicts on.
The Go compiler and standard library. 1,247 file merges from 500 merge commits.
Go's explicit structure (top-level functions, clear type declarations) works well with entity-level merge. 58% human match rate. Common patterns: both branches adding different functions, struct field additions in different types.
The TypeScript compiler. 1,639 file merges from 500 merge commits. Highest human match rate but 3 regressions.
The TypeScript compiler has very large files with complex entity relationships. The 3 regressions are under investigation. The 75% human match rate (highest of all repos) shows that when weave does resolve, it closely matches developer intent.
What the numbers mean.
| Term | Definition |
|---|---|
| Files Tested | Number of individual file merges where both branches touched the same file (both-touched files across all merge commits). |
| Both Clean | Both git and weave merged cleanly. No conflict from either tool. |
| Win | Git produced a conflict, but weave resolved cleanly. A false conflict eliminated. |
| Both Conflict | Both git and weave produced conflicts. A real semantic collision that requires human judgment. |
| Regression | Git merged cleanly, but weave produced a different result than the human. Weave introduced an error where git was fine. |
| Human Match | Of the wins, how many produce output identical to what the developer actually wrote. Higher = weave's merge matches human intent. |
| Resolution Rate | Wins / (Wins + Both Conflict). What percentage of git's conflicts weave eliminates. |
Run the benchmarks yourself.
# Clone a repo $ git clone --bare https://github.com/git/git /tmp/git-bench # Run benchmark (scans up to 500 merge commits) $ weave bench-repo /tmp/git-bench # Show diffs for non-matching cases $ weave bench-repo /tmp/git-bench --show-diff # Save base/ours/theirs/human/weave for each case $ weave bench-repo /tmp/git-bench --save benchmarks/git