There may be hope for humanity yet!
Jokes aside, interested in eventually exploring how well the new OpenAI agent mode handles these types of tasks if the underlying foundation models struggle with this type of work.
maybe add a couple harder APIs (or more complex queries) as well where current models overwhelmingly fail?
that way, we can still measure models in a couple of years against the current ones.
also adding o3 and for reference the model(s) used by superglue in this benchmark would be interesting.
adinagoerres•6h ago
tl;dr: LLMs suck at writing code to use APIs.
We ran 630 integration tests across 21 common APIs (Stripe, Slack, GitHub, etc.) using 6 different LLMs. Here are our key findings: - Best general LLM: 68% success rate. That's 1 in 3 API calls failing. Would you ship that? - Our integration layer scored a 91% success rate, showing us that just throwing bigger/better LLMs at the problem won't solve it. - Only 6 out of 21 APIs worked 100% of the time, every other API had failures. - Anthropic’s models are significantly better at building API integrations than other providers.
What makes LLMs fail hard: - Lack of context (LLMs are just not great at understanding what API endpoints exist and what they do, even if you give them documentation which we did) - Multi-step workflows (chaining API calls) - Complex API design: APIs like Square, PostHog, Asana (Forcing project selection among other things trips llms over)
We've open-sourced the benchmark so you can test any API and see where it ranks: https://github.com/superglue-ai/superglue/tree/main/packages...
Check out the repo, consider giving it a star, or see the full ranking at https://superglue.ai/api-ranking/
If you're building agents that need reliable API access, we'd love to hear your approach - or you can try our integration layer at superglue.ai.
Next up: benchmarking MCP.