Running evals on the 40-task Linear API benchmark, most frontier models scored surprisingly well:
- Claude Opus 4.5: 95% (38/40)
- GLM 4.6: 87.5% (35/40)
- Claude Sonnet 4.5: 85% (34/40)
- Claude Haiku 4.5: 82.5% (33/40)
- Kimi K2: 82.5% (33/40)
- Grok 4.1 Fast: 80% (32/40)
- GPT 5.1: 77.5% (31/40)
This makes me think if we really need to reinvent the wheel and make special interfaces (MCPs) for agents interacting with services, when they can just use APIs as they are.
hubertmarek•23m ago
- Claude Opus 4.5: 95% (38/40) - GLM 4.6: 87.5% (35/40) - Claude Sonnet 4.5: 85% (34/40) - Claude Haiku 4.5: 82.5% (33/40) - Kimi K2: 82.5% (33/40) - Grok 4.1 Fast: 80% (32/40) - GPT 5.1: 77.5% (31/40)
This makes me think if we really need to reinvent the wheel and make special interfaces (MCPs) for agents interacting with services, when they can just use APIs as they are.
Feedback is more than welcome. Thanks!