frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Show HN: We let agents use APIs to find out if they can actually...do things?

https://superglue.ai/api-ranking/
8•adinagoerres•6h ago

Comments

adinagoerres•6h ago
Hi HN! Adina here from superglue. Today I’d like to share a new benchmark we’ve just open sourced: an Agent-API Benchmark, in which we test how well LLMs handle APIs.

tl;dr: LLMs suck at writing code to use APIs.

We ran 630 integration tests across 21 common APIs (Stripe, Slack, GitHub, etc.) using 6 different LLMs. Here are our key findings: - Best general LLM: 68% success rate. That's 1 in 3 API calls failing. Would you ship that? - Our integration layer scored a 91% success rate, showing us that just throwing bigger/better LLMs at the problem won't solve it. - Only 6 out of 21 APIs worked 100% of the time, every other API had failures. - Anthropic’s models are significantly better at building API integrations than other providers.

What makes LLMs fail hard: - Lack of context (LLMs are just not great at understanding what API endpoints exist and what they do, even if you give them documentation which we did) - Multi-step workflows (chaining API calls) - Complex API design: APIs like Square, PostHog, Asana (Forcing project selection among other things trips llms over)

We've open-sourced the benchmark so you can test any API and see where it ranks: https://github.com/superglue-ai/superglue/tree/main/packages...

Check out the repo, consider giving it a star, or see the full ranking at https://superglue.ai/api-ranking/

If you're building agents that need reliable API access, we'd love to hear your approach - or you can try our integration layer at superglue.ai.

Next up: benchmarking MCP.

sfaist•6h ago
The reason we think this would be interesting to share here is that these llm benchmarks seem increasingly disconnected from reality. idc if the llm can solve a PhD math question or make scientific discoveries, I care if it can solve our problems, which in our case is automating API integrations. Turns out it mostly can't, which tracks well with our experience using cursor.
michael-fuest•6h ago
Love hearing Sam Altman talking about feeling the AGI and seeing that million dollar reasoning models can't execute simple API calls despite having a lot of docs and the entire internet as baked-in knowledge.

There may be hope for humanity yet!

Jokes aside, interested in eventually exploring how well the new OpenAI agent mode handles these types of tasks if the underlying foundation models struggle with this type of work.

kyleledbetter•5h ago
Super cool that you all published these benchmarks. We've seen similar, where some APIs work REALLY well with agents, but convoluted ones just product churn in the agent tool calls. Curious to see how Supabase's APIs would perform with your benchmarks. We've seen their PostgREST via API do really well with our agents (with targeted system prompts), and it's a fairly non-standard REST structure.
nimar•4h ago
v interesting benchmark, looking forward to see it evolve over time. actually surprisingly good results already.

maybe add a couple harder APIs (or more complex queries) as well where current models overwhelmingly fail?

that way, we can still measure models in a couple of years against the current ones.

also adding o3 and for reference the model(s) used by superglue in this benchmark would be interesting.

2025 Scholar Metrics Released

https://scholar.googleblog.com/2025/07/2025-scholar-metrics-released.html
1•jeremyscanvic•25s ago•0 comments

Proton completes SoC 2 Type II audit, reinforcing trust for business users

https://proton.me/blog/soc-2
2•mikece•1m ago•0 comments

HP owed over $940M by Mike Lynch's estate, ex-business partner, UK court rules

https://www.reuters.com/sustainability/boards-policy-regulation/hp-owed-over-940-mln-by-mike-lynchs-estate-ex-business-partner-uk-court-rules-2025-07-22/
1•petethomas•2m ago•0 comments

Functional Documentation

https://www.dzombak.com/blog/2025/07/functional-documentation/
1•ingve•2m ago•0 comments

The Food Court 5000 is a Portland-based, retro-fitness, mall-walking movement

https://foodcourt5k.com/
1•mooreds•3m ago•0 comments

The kill ring is a list of blocks of text

https://www.gnu.org/software/emacs/manual/html_node/emacs/Kill-Ring.html
1•Bluestein•4m ago•0 comments

Bookmer.com launched Browser extention for Chrome

https://chromewebstore.google.com/detail/bookmer-launcher/mladlmojookmijmdcdabepbcefjokhfi
1•g_briel•10m ago•0 comments

Show HN: I built BodyCount to track my 'score' but found deeper meaning

https://app.bodycount.love/
1•dsstudios•10m ago•1 comments

Rest in Peace Ozzy

1•quicon•13m ago•0 comments

New Duke Study Finds Obesity Rises with Caloric Intake, Not Couch Time

https://trinity.duke.edu/news/new-duke-study-finds-obesity-rises-caloric-intake-not-couch-time
1•ivewonyoung•14m ago•0 comments

Morse Code

https://kmcd.dev/posts/morse/
1•ingve•15m ago•1 comments

Show HN: How Claude Code Improved My Dev Workflow

1•IgorGanapolsky•15m ago•0 comments

Diffusion Beats Autoregressive in Data-Constrained Settings

https://arxiv.org/abs/2507.15857
1•badmonster•16m ago•1 comments

Liking Yellow Imply Driving a School Bus? Semantic Leakage in LLMs

https://arxiv.org/abs/2408.06518
1•Bluestein•16m ago•0 comments

When Existence is Inefficient (2022)

https://inference-review.com/article/when-existence-is-inefficient
1•aleph_minus_one•20m ago•0 comments

Comment with your favorite local-first content

https://lofi.so/mentions
2•yonz•23m ago•2 comments

The average Apple Watch user gets 49 minutes of deep sleep per night

https://www.empirical.health/blog/apple-watch-deep-sleep-meaning/
2•brandonb•27m ago•0 comments

Windows 11 gets new Black Screen of Death, auto recovery tool

https://www.bleepingcomputer.com/news/microsoft/windows-11-gets-new-black-screen-of-death-auto-recovery-tool/
2•DocFeind•27m ago•0 comments

China begins building largest dam, fuelling fears in India

https://www.bbc.com/news/articles/c4gk1251w14o
1•perihelions•30m ago•0 comments

Show HN: How Claude Code Improved My Dev Workflow

4•IgorGanapolsky•32m ago•1 comments

Despite deepfake audio tech, banks, ISPs push voice print authentication (2021)

https://keydiscussions.com/2021/12/07/despite-the-prevalence-of-deepfake-audio-tech-banks-and-isps-rush-ahead-with-voice-print-authentication-%f0%9f%92%80/
2•spenvo•33m ago•1 comments

The dangers of Musk's new, Manga-style [flirty] chatbot [video]

https://www.youtube.com/shorts/17rkMuExdPI
5•mdp2021•36m ago•2 comments

Qwen3 – Coder

https://old.reddit.com/r/LocalLLaMA/comments/1m6mew9/qwen3_coder/
4•mircea•36m ago•2 comments

Vector Tiles are deployed on OpenStreetMap.org

https://blog.openstreetmap.org/2025/07/22/vector-tiles-are-deployed-on-openstreetmap-org/
5•ikawe•39m ago•0 comments

How Silicon Valley is becoming militarized

https://english.elpais.com/economy-and-business/2025-07-21/big-tech-enters-the-war-business-how-silicon-valley-is-becoming-militarized.html
2•geox•40m ago•0 comments

Show HN: How Claude Code Improved My Dev Workflow

2•IgorGanapolsky•45m ago•0 comments

Checklist Genie – Create Sharable Checklists with Just Your Voice and AI

https://checklistgenie.app
1•alohaplannerapp•47m ago•1 comments

Qwen3-Coder: Agentic Coding in the World

https://qwenlm.github.io/blog/qwen3-coder/
7•danielhanchen•47m ago•1 comments

Ask HN: A Reddit UI where all writing is done by an AI?

1•amichail•47m ago•2 comments

Show HN: A CLI tool for creating Typst screenplay projects

https://github.com/ChaseRensberger/typstscript
1•ChaseRensberger•50m ago•0 comments