sem — Agent Accuracy Benchmark

Agent accuracy

Same questions about code changes, answered by Claude Sonnet 4.5. One gets sem diff JSON, the other gets raw git diff. Scored against ground truth across 3 commits (small, medium, large).

sem diff

git diff

Q1: List added functions (F1)

Q2: Files with modified entities (F1)

Q3: Entity type counts (accuracy)

Q4: Added/modified/deleted counts (exact)

Overall average

Why git diff fails agents

Tested on 3 commits from this repo. Each exposes a different failure mode when agents reason about raw line diffs.

Line/entity confusion — Q4 on all commits

git diff has no concept of "entity." When asked to count added entities, the model counts + lines instead. On the speed optimization commit (fffb38f), git-based Claude reported 238 added — the number of + lines in the diff. The actual count is 32 added entities. On the Rust rewrite, it said 1,122 vs truth of 259.

git-based: {"added": 238, "modified": 10, "deleted": 0}
sem-based: {"added": 32, "modified": 10, "deleted": 3} // exact match

No entity type ontology — Q3 on all commits

Line diffs have no AST. On commit 9f7f1c7 (7 new commands), git-based Claude returned {"file": 11} — it counted files, not entities. Truth: {"interface": 12, "function": 15, "variable": 3, "class": 1}. On the Rust rewrite, it found 16 functions when there are 87, and completely missed chunk (80), property (29), impl (10).

Can't distinguish add vs modify — Q1, Q2

Modified functions show + and - hunks, same as new functions in changed files. On fffb38f, git-based Claude listed 9 "added" functions — 4 were actually modified (detectJsonChanges, parseDiffNameStatus, detectAndGetFiles, populateContent). Precision dropped to 55.6%. sem tags each entity with changeType: "added" vs changeType: "modified".

Config file blindness — Q2, Q3

JSON/YAML/TOML changes appear as raw +/- key-value lines. The model doesn't classify these as "entities." On fffb38f, git-based Claude missed package.json and package-lock.json as containing modified entities (recall dropped to 66.7%). sem reports entityType: "property" for each changed key.

Context window pressure — Q1 on large diffs

Both tools degrade on the 3,905-line Rust rewrite (ae576ab). git diff was truncated at 100KB — the model found 25/67 added functions (37% recall). sem's stripped JSON is much more compact (no source code), so the model saw all 278 entities but still only extracted 43/67 (64% recall). sem wins, but large diffs are where attention limits hurt regardless of format.

3 commits × 4 questions × 2 tools = 24 API calls. Claude Sonnet 4.5, temperature 0. Content fields stripped from sem JSON for fair comparison. Reproduce →