We show that a simple harness fixing 'intent-execution gap' achieves SOTA pass@1 on 21 models (across diverse model providers of Claude, GPT, Gemini, Grok, Qwen) on agentic benchmarks (SWE-Pro, -verif, tb2). This is first time a single open-source harness reproduce/improve results on popular benchmarks for modern LLMs!
The code is public to try and build-on:
Code: https://github.com/strands-labs/benchmark-harnesses
More importantly, we also generated 138k high-quality agent trajectories (SOTA pass@1) and present a detailed study on them
"Dissecting model behavior through agent trajectories" https://arxiv.org/abs/2606.17454
Models that achieve similar pass@1 behaves very different internally and we quantize it using several metrics (such as code state-spaces)
gaurav71531•1h ago
More importantly, we also generated 138k high-quality agent trajectories (SOTA pass@1) and present a detailed study on them "Dissecting model behavior through agent trajectories" https://arxiv.org/abs/2606.17454
Models that achieve similar pass@1 behaves very different internally and we quantize it using several metrics (such as code state-spaces)