This is the full URL that does a composite average across DeepSWE, Terminal-Bench and SWE-Atlas-QnA. Models are measured in their respective harnesses.
What is surprising to me is that Claude Code + Fable 5 (max) is on par with Codex + GPT-5.5 (xhigh)... yet Fable burnt through 1M extra tokens.
StanAngeloff•1h ago
This is the full URL that does a composite average across DeepSWE, Terminal-Bench and SWE-Atlas-QnA. Models are measured in their respective harnesses.
What is surprising to me is that Claude Code + Fable 5 (max) is on par with Codex + GPT-5.5 (xhigh)... yet Fable burnt through 1M extra tokens.