I benchmarked Claude Code and GitHub Copilot on the same model (Haiku 4.5) with and without RAG-powered semantic search across 60 queries on a real codebase.
RAG didn't make search more accurate on Claude Code, but it cut token consumption by 28%. On Copilot, it cut time to resolution by 44% and improved F1 by 19.5%.
The bigger finding: controlling for model, tool design alone accounts for a 30pp recall gap between the two tools. Benchmark code and data are open source.
mikeayles•1h ago
RAG didn't make search more accurate on Claude Code, but it cut token consumption by 28%. On Copilot, it cut time to resolution by 44% and improved F1 by 19.5%.
The bigger finding: controlling for model, tool design alone accounts for a 30pp recall gap between the two tools. Benchmark code and data are open source.