How well can agents work with APIs they’ve never seen before? We tested 41 APIs across 8 different LLMs to find out.
API execution is great for benchmarking, because it tests core qualities and limitations of LLMs:_the depth of the data they were trained on, their stateless architecture, context dependency, and reasoning.
Today we're releasing v2 of API-Bench:_a benchmark that tests how well LLMs can execute against APIs. Here are the results: https://superglue.ai/benchmark_v2
Tl;dr:_LLMs fail at integrations because they lack ground truth, lack state, lack debugging ability, and lack access to real system context - everything API integrations fundamentally require.
Here’s what we found:
1. LLMs are only as good as the data they’re trained on:_when docs change, APIs evolve, or systems are niche/long-tail, they use outdated patterns, guess missing pieces and hallucinate endpoints and parameters.
2. LLMs are stateless, but integrations are stateful:_auth handshakes, pagination, retries, multi-step flows all need memory but LLMs can’t persist intermediate values or reason across steps.
3. LLMs produce code that “looks right” but fails at runtime: LLMs cannot isolate the failing step and understand real error messages, so they can’t change what’s broken or retry with new hypotheses.
4. LLMs can’t reliably interpret imperfect API design:_humans can infer the intended function, LLMs will hallucinate what looks reasonable.
adinagoerres•36m ago
API execution is great for benchmarking, because it tests core qualities and limitations of LLMs:_the depth of the data they were trained on, their stateless architecture, context dependency, and reasoning.
Today we're releasing v2 of API-Bench:_a benchmark that tests how well LLMs can execute against APIs. Here are the results: https://superglue.ai/benchmark_v2
Tl;dr:_LLMs fail at integrations because they lack ground truth, lack state, lack debugging ability, and lack access to real system context - everything API integrations fundamentally require.
Here’s what we found:
1. LLMs are only as good as the data they’re trained on:_when docs change, APIs evolve, or systems are niche/long-tail, they use outdated patterns, guess missing pieces and hallucinate endpoints and parameters.
2. LLMs are stateless, but integrations are stateful:_auth handshakes, pagination, retries, multi-step flows all need memory but LLMs can’t persist intermediate values or reason across steps.
3. LLMs produce code that “looks right” but fails at runtime: LLMs cannot isolate the failing step and understand real error messages, so they can’t change what’s broken or retry with new hypotheses.
4. LLMs can’t reliably interpret imperfect API design:_humans can infer the intended function, LLMs will hallucinate what looks reasonable.
We open sourced the benchmark so you can test your own APIs or contribute new ones: https://github.com/superglue-ai/superglue/tree/main/eval/llm...
Curious to hear your experience, and of course always happy to share more learnings.