We ran Dewey's agentic retrieval endpoint on all 150 FinanceBench questions, a benchmark of financial Q&A over real SEC filings. To control for model improvements, we also ran Claude Opus 4.6 directly with each PDF loaded into context (no retrieval). Full-context scored 76.0%; agentic retrieval with the same model scored 83.7%. Six PepsiCo 10-Ks exceeded Claude's 1M token limit and couldn't be answered via full-context at all.
The finding that surprised us most: document enrichment (section summaries, table captions) added 3.8 points for Opus and cost 1.6 points for GPT-5.4. Same features, opposite effects. The explanation is in the tool call distributions. Opus averaged 21 searches per question, GPT-5.4 averaged 9. Enrichment is a navigation aid and if you're not navigating, it's noise.
lambdabaa•1h ago
The finding that surprised us most: document enrichment (section summaries, table captions) added 3.8 points for Opus and cost 1.6 points for GPT-5.4. Same features, opposite effects. The explanation is in the tool call distributions. Opus averaged 21 searches per question, GPT-5.4 averaged 9. Enrichment is a navigation aid and if you're not navigating, it's noise.