Show HN: We let agents use APIs to find out if they can actually...do things?

8•adinagoerres•6h ago

Comments

adinagoerres•6h ago

Hi HN! Adina here from superglue. Today I’d like to share a new benchmark we’ve just open sourced: an Agent-API Benchmark, in which we test how well LLMs handle APIs.

tl;dr: LLMs suck at writing code to use APIs.

We ran 630 integration tests across 21 common APIs (Stripe, Slack, GitHub, etc.) using 6 different LLMs. Here are our key findings: - Best general LLM: 68% success rate. That's 1 in 3 API calls failing. Would you ship that? - Our integration layer scored a 91% success rate, showing us that just throwing bigger/better LLMs at the problem won't solve it. - Only 6 out of 21 APIs worked 100% of the time, every other API had failures. - Anthropic’s models are significantly better at building API integrations than other providers.

What makes LLMs fail hard: - Lack of context (LLMs are just not great at understanding what API endpoints exist and what they do, even if you give them documentation which we did) - Multi-step workflows (chaining API calls) - Complex API design: APIs like Square, PostHog, Asana (Forcing project selection among other things trips llms over)

We've open-sourced the benchmark so you can test any API and see where it ranks: https://github.com/superglue-ai/superglue/tree/main/packages...

Check out the repo, consider giving it a star, or see the full ranking at https://superglue.ai/api-ranking/

If you're building agents that need reliable API access, we'd love to hear your approach - or you can try our integration layer at superglue.ai.

Next up: benchmarking MCP.

sfaist•6h ago

The reason we think this would be interesting to share here is that these llm benchmarks seem increasingly disconnected from reality. idc if the llm can solve a PhD math question or make scientific discoveries, I care if it can solve our problems, which in our case is automating API integrations. Turns out it mostly can't, which tracks well with our experience using cursor.

michael-fuest•6h ago

Love hearing Sam Altman talking about feeling the AGI and seeing that million dollar reasoning models can't execute simple API calls despite having a lot of docs and the entire internet as baked-in knowledge.

There may be hope for humanity yet!

Jokes aside, interested in eventually exploring how well the new OpenAI agent mode handles these types of tasks if the underlying foundation models struggle with this type of work.

kyleledbetter•5h ago

Super cool that you all published these benchmarks. We've seen similar, where some APIs work REALLY well with agents, but convoluted ones just product churn in the agent tool calls. Curious to see how Supabase's APIs would perform with your benchmarks. We've seen their PostgREST via API do really well with our agents (with targeted system prompts), and it's a fairly non-standard REST structure.

nimar•4h ago

v interesting benchmark, looking forward to see it evolve over time. actually surprisingly good results already.

maybe add a couple harder APIs (or more complex queries) as well where current models overwhelmingly fail?

that way, we can still measure models in a couple of years against the current ones.

also adding o3 and for reference the model(s) used by superglue in this benchmark would be interesting.

2025 Scholar Metrics Released

Proton completes SoC 2 Type II audit, reinforcing trust for business users

HP owed over $940M by Mike Lynch's estate, ex-business partner, UK court rules

Functional Documentation

The Food Court 5000 is a Portland-based, retro-fitness, mall-walking movement

The kill ring is a list of blocks of text

Bookmer.com launched Browser extention for Chrome

Show HN: I built BodyCount to track my 'score' but found deeper meaning

Rest in Peace Ozzy

New Duke Study Finds Obesity Rises with Caloric Intake, Not Couch Time

Morse Code

Show HN: How Claude Code Improved My Dev Workflow

Diffusion Beats Autoregressive in Data-Constrained Settings

Liking Yellow Imply Driving a School Bus? Semantic Leakage in LLMs

When Existence is Inefficient (2022)

Comment with your favorite local-first content

The average Apple Watch user gets 49 minutes of deep sleep per night

Windows 11 gets new Black Screen of Death, auto recovery tool

China begins building largest dam, fuelling fears in India

Show HN: How Claude Code Improved My Dev Workflow

Despite deepfake audio tech, banks, ISPs push voice print authentication (2021)

The dangers of Musk's new, Manga-style [flirty] chatbot [video]

Qwen3 – Coder

Vector Tiles are deployed on OpenStreetMap.org

How Silicon Valley is becoming militarized

Show HN: How Claude Code Improved My Dev Workflow

Checklist Genie – Create Sharable Checklists with Just Your Voice and AI

Qwen3-Coder: Agentic Coding in the World

Ask HN: A Reddit UI where all writing is done by an AI?

Show HN: A CLI tool for creating Typst screenplay projects

Show HN: We let agents use APIs to find out if they can actually...do things?

Comments

2025 Scholar Metrics Released

Proton completes SoC 2 Type II audit, reinforcing trust for business users

HP owed over $940M by Mike Lynch's estate, ex-business partner, UK court rules

Functional Documentation

The Food Court 5000 is a Portland-based, retro-fitness, mall-walking movement

The kill ring is a list of blocks of text

Bookmer.com launched Browser extention for Chrome

Show HN: I built BodyCount to track my 'score' but found deeper meaning

Rest in Peace Ozzy

New Duke Study Finds Obesity Rises with Caloric Intake, Not Couch Time

Morse Code

Show HN: How Claude Code Improved My Dev Workflow

Diffusion Beats Autoregressive in Data-Constrained Settings

Liking Yellow Imply Driving a School Bus? Semantic Leakage in LLMs

When Existence is Inefficient (2022)

Comment with your favorite local-first content

The average Apple Watch user gets 49 minutes of deep sleep per night

Windows 11 gets new Black Screen of Death, auto recovery tool

China begins building largest dam, fuelling fears in India

Show HN: How Claude Code Improved My Dev Workflow

Despite deepfake audio tech, banks, ISPs push voice print authentication (2021)

The dangers of Musk's new, Manga-style [flirty] chatbot [video]

Qwen3 – Coder

Vector Tiles are deployed on OpenStreetMap.org

How Silicon Valley is becoming militarized

Show HN: How Claude Code Improved My Dev Workflow

Checklist Genie – Create Sharable Checklists with Just Your Voice and AI

Qwen3-Coder: Agentic Coding in the World

Ask HN: A Reddit UI where all writing is done by an AI?

Show HN: A CLI tool for creating Typst screenplay projects