We created API-Bench to test how well LLMs execute against APIs

2•adinagoerres•2mo ago

Comments

adinagoerres•2mo ago

How well can agents work with APIs they’ve never seen before? We tested 41 APIs across 8 different LLMs to find out.

API execution is great for benchmarking, because it tests core qualities and limitations of LLMs:_the depth of the data they were trained on, their stateless architecture, context dependency, and reasoning.

Today we're releasing v2 of API-Bench:_a benchmark that tests how well LLMs can execute against APIs. Here are the results: https://superglue.ai/benchmark_v2

Tl;dr:_LLMs fail at integrations because they lack ground truth, lack state, lack debugging ability, and lack access to real system context - everything API integrations fundamentally require.

Here’s what we found:

1. LLMs are only as good as the data they’re trained on:_when docs change, APIs evolve, or systems are niche/long-tail, they use outdated patterns, guess missing pieces and hallucinate endpoints and parameters.

2. LLMs are stateless, but integrations are stateful:_auth handshakes, pagination, retries, multi-step flows all need memory but LLMs can’t persist intermediate values or reason across steps.

3. LLMs produce code that “looks right” but fails at runtime: LLMs cannot isolate the failing step and understand real error messages, so they can’t change what’s broken or retry with new hypotheses.

4. LLMs can’t reliably interpret imperfect API design:_humans can infer the intended function, LLMs will hallucinate what looks reasonable.

We open sourced the benchmark so you can test your own APIs or contribute new ones: https://github.com/superglue-ai/superglue/tree/main/eval/llm...

Curious to hear your experience, and of course always happy to share more learnings.

Reverse Engineering Raiders of the Lost Ark for the Atari 2600

The AI4Agile Practitioners Report 2026

Digital Independence Day

What a bot hacking attempt looks like: SQL injections galore

Show HN: FlashMesh – An encrypted file mesh across Google Drive and Dropbox

Show HN: AgentLens – Open-source observability and audit trail for AI agents

Show HN: ShipClaw – Deploy OpenClaw to the Cloud in One Click

Unlock the Power of Real-Time Google Trends Visit: Www.daily-Trending.org

Explanation of British Class System

Show HN: Jwtpeek – minimal, user-friendly JWT inspector in Go

Willow – Protocols for an uncertain future [video]

Feedback on a client-side, privacy-first PDF editor I built

Clay Christensen's Milkshake Marketing (2011)

Show HN: WeaveMind – AI Workflows with human-in-the-loop

Show HN: Seedream 5.0: free AI image generator that claims strong text rendering

A contributor trust management system based on explicit vouches

Show HN: Analyzing 9 years of HN side projects that reached $500/month

The Floating Dock for Developers

Arcan Explained – A browser for different webs

We are not scared of AI, we are scared of irrelevance

Quartz Crystals

Show HN: I built a free dictionary API to avoid API keys

Show HN: Kybera – Agentic Smart Wallet with AI Osint and Reputation Tracking

Show HN: brew changelog – find upstream changelogs for Homebrew packages

Any chess position with 8 pieces on board and one pair of pawns has been solved

LLMs as Language Compilers: Lessons from Fortran for the Future of Coding

Projecting high-dimensional tensor/matrix/vect GPT–>ML

Show HN: Free Bank Statement Analyzer to Find Spending Leaks and Save Money

Our Stolen Light

Matchlock: Linux-based sandboxing for AI agents