Ran a proper benchmark to see if it actually matters.
Setup: FastAPI codebase (800 Python files), Claude Sonnet 4.6, 7 tasks (bug fixes, features, refactors, code understanding), 3 runs per task per arm, 42 total executions. Both arms run in full isolation with --strict-mcp-config. Results collected via headless claude -p with --output-format stream-json.
Results:
Cost per task: $0.78 → $0.33 (-58%)
Duration: 170s → 132s (-22%)
Output tokens: 504 → 189 (-63%)
Savings by task type: Code understanding: -57%
New features: -53%
Refactoring: -48%
Bug fixes: -29%
The pattern: baseline Claude makes ~15 Read + 4 Grep + 4 Glob calls per task, accumulating context incrementally. With the graph, it averages 2.3 run_pipeline calls that return pre-ranked context in one shot. Less cache creation, fewer round trips, and the agent writes more concise responses because it already has the right context.Code understanding benefits the most because that's where the agent spends the most tool calls exploring. Bug fixes benefit the least because the scope is usually narrow enough that a few Read calls get you there anyway.
The graph is built with tree-sitter (functions, classes, types, imports, call references), stored in SQLite, updated incrementally on file save. Also does cross-session memory — observations linked to graph nodes, auto-flagged stale when the code changes.
Single Rust binary, everything local, zero network calls except an optional license check.
https://vexp.dev — free tier, no account needed.
Happy to answer questions about the approach, the benchmark setup, or where it breaks down.
sjkoelle•1h ago
nicola_alessi•1h ago
The savings actually increase with larger files because that's where the baseline wastes the most, Claude reads a 500-line file to use 20 lines of it.