Agent accuracy

Same questions about code changes, answered by Claude Sonnet 4.5. One gets sem diff JSON, the other gets raw git diff. Scored against ground truth across 3 commits (small, medium, large).

sem diff
git diff
Q1: List added functions (F1)
Q2: Files with modified entities (F1)
Q3: Entity type counts (accuracy)
Q4: Added/modified/deleted counts (exact)
Overall average

Why git diff fails agents

Tested on 3 commits from this repo. Each exposes a different failure mode when agents reason about raw line diffs.

Line/entity confusion — Q4 on all commits

git diff has no concept of "entity." When asked to count added entities, the model counts + lines instead. On the speed optimization commit (fffb38f), git-based Claude reported 238 added — the number of + lines in the diff. The actual count is 32 added entities. On the Rust rewrite, it said 1,122 vs truth of 259.

git-based: {"added": 238, "modified": 10, "deleted": 0}
sem-based: {"added": 32, "modified": 10, "deleted": 3} // exact match

No entity type ontology — Q3 on all commits

Line diffs have no AST. On commit 9f7f1c7 (7 new commands), git-based Claude returned {"file": 11} — it counted files, not entities. Truth: {"interface": 12, "function": 15, "variable": 3, "class": 1}. On the Rust rewrite, it found 16 functions when there are 87, and completely missed chunk (80), property (29), impl (10).

Can't distinguish add vs modify — Q1, Q2

Modified functions show + and - hunks, same as new functions in changed files. On fffb38f, git-based Claude listed 9 "added" functions — 4 were actually modified (detectJsonChanges, parseDiffNameStatus, detectAndGetFiles, populateContent). Precision dropped to 55.6%. sem tags each entity with changeType: "added" vs changeType: "modified".

Config file blindness — Q2, Q3

JSON/YAML/TOML changes appear as raw +/- key-value lines. The model doesn't classify these as "entities." On fffb38f, git-based Claude missed package.json and package-lock.json as containing modified entities (recall dropped to 66.7%). sem reports entityType: "property" for each changed key.

Context window pressure — Q1 on large diffs

Both tools degrade on the 3,905-line Rust rewrite (ae576ab). git diff was truncated at 100KB — the model found 25/67 added functions (37% recall). sem's stripped JSON is much more compact (no source code), so the model saw all 278 entities but still only extracted 43/67 (64% recall). sem wins, but large diffs are where attention limits hurt regardless of format.

3 commits × 4 questions × 2 tools = 24 API calls. Claude Sonnet 4.5, temperature 0. Content fields stripped from sem JSON for fair comparison. Reproduce →