frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Show HN: Moli P2P – An ephemeral, serverless image gallery (Rust and WebRTC)

https://moli-green.is/
1•ShinyaKoyano•2m ago•0 comments

How I grow my X presence?

https://www.reddit.com/r/GrowthHacking/s/UEc8pAl61b
1•m00dy•4m ago•0 comments

What's the cost of the most expensive Super Bowl ad slot?

https://ballparkguess.com/?id=5b98b1d3-5887-47b9-8a92-43be2ced674b
1•bkls•5m ago•0 comments

What if you just did a startup instead?

https://alexaraki.substack.com/p/what-if-you-just-did-a-startup
1•okaywriting•11m ago•0 comments

Hacking up your own shell completion (2020)

https://www.feltrac.co/environment/2020/01/18/build-your-own-shell-completion.html
1•todsacerdoti•14m ago•0 comments

Show HN: Gorse 0.5 – Open-source recommender system with visual workflow editor

https://github.com/gorse-io/gorse
1•zhenghaoz•15m ago•0 comments

GLM-OCR: Accurate × Fast × Comprehensive

https://github.com/zai-org/GLM-OCR
1•ms7892•16m ago•0 comments

Local Agent Bench: Test 11 small LLMs on tool-calling judgment, on CPU, no GPU

https://github.com/MikeVeerman/tool-calling-benchmark
1•MikeVeerman•16m ago•0 comments

Show HN: AboutMyProject – A public log for developer proof-of-work

https://aboutmyproject.com/
1•Raiplus•17m ago•0 comments

Expertise, AI and Work of Future [video]

https://www.youtube.com/watch?v=wsxWl9iT1XU
1•indiantinker•17m ago•0 comments

So Long to Cheap Books You Could Fit in Your Pocket

https://www.nytimes.com/2026/02/06/books/mass-market-paperback-books.html
3•pseudolus•18m ago•1 comments

PID Controller

https://en.wikipedia.org/wiki/Proportional%E2%80%93integral%E2%80%93derivative_controller
1•tosh•22m ago•0 comments

SpaceX Rocket Generates 100GW of Power, or 20% of US Electricity

https://twitter.com/AlecStapp/status/2019932764515234159
2•bkls•22m ago•0 comments

Kubernetes MCP Server

https://github.com/yindia/rootcause
1•yindia•23m ago•0 comments

I Built a Movie Recommendation Agent to Solve Movie Nights with My Wife

https://rokn.io/posts/building-movie-recommendation-agent
4•roknovosel•23m ago•0 comments

What were the first animals? The fierce sponge–jelly battle that just won't end

https://www.nature.com/articles/d41586-026-00238-z
2•beardyw•32m ago•0 comments

Sidestepping Evaluation Awareness and Anticipating Misalignment

https://alignment.openai.com/prod-evals/
1•taubek•32m ago•0 comments

OldMapsOnline

https://www.oldmapsonline.org/en
1•surprisetalk•34m ago•0 comments

What It's Like to Be a Worm

https://www.asimov.press/p/sentience
2•surprisetalk•34m ago•0 comments

Don't go to physics grad school and other cautionary tales

https://scottlocklin.wordpress.com/2025/12/19/dont-go-to-physics-grad-school-and-other-cautionary...
2•surprisetalk•34m ago•0 comments

Lawyer sets new standard for abuse of AI; judge tosses case

https://arstechnica.com/tech-policy/2026/02/randomly-quoting-ray-bradbury-did-not-save-lawyer-fro...
5•pseudolus•35m ago•0 comments

AI anxiety batters software execs, costing them combined $62B: report

https://nypost.com/2026/02/04/business/ai-anxiety-batters-software-execs-costing-them-62b-report/
1•1vuio0pswjnm7•35m ago•0 comments

Bogus Pipeline

https://en.wikipedia.org/wiki/Bogus_pipeline
1•doener•36m ago•0 comments

Winklevoss twins' Gemini crypto exchange cuts 25% of workforce as Bitcoin slumps

https://nypost.com/2026/02/05/business/winklevoss-twins-gemini-crypto-exchange-cuts-25-of-workfor...
2•1vuio0pswjnm7•36m ago•0 comments

How AI Is Reshaping Human Reasoning and the Rise of Cognitive Surrender

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6097646
3•obscurette•37m ago•0 comments

Cycling in France

https://www.sheldonbrown.com/org/france-sheldon.html
2•jackhalford•38m ago•0 comments

Ask HN: What breaks in cross-border healthcare coordination?

1•abhay1633•39m ago•0 comments

Show HN: Simple – a bytecode VM and language stack I built with AI

https://github.com/JJLDonley/Simple
2•tangjiehao•41m ago•0 comments

Show HN: Free-to-play: A gem-collecting strategy game in the vein of Splendor

https://caratria.com/
1•jonrosner•42m ago•1 comments

My Eighth Year as a Bootstrapped Founde

https://mtlynch.io/bootstrapped-founder-year-8/
1•mtlynch•42m ago•0 comments
Open in hackernews

Show HN: We let agents use APIs to find out if they can actually...do things?

https://superglue.ai/api-ranking/
11•adinagoerres•6mo ago

Comments

adinagoerres•6mo ago
Hi HN! Adina here from superglue. Today I’d like to share a new benchmark we’ve just open sourced: an Agent-API Benchmark, in which we test how well LLMs handle APIs.

tl;dr: LLMs suck at writing code to use APIs.

We ran 630 integration tests across 21 common APIs (Stripe, Slack, GitHub, etc.) using 6 different LLMs. Here are our key findings: - Best general LLM: 68% success rate. That's 1 in 3 API calls failing. Would you ship that? - Our integration layer scored a 91% success rate, showing us that just throwing bigger/better LLMs at the problem won't solve it. - Only 6 out of 21 APIs worked 100% of the time, every other API had failures. - Anthropic’s models are significantly better at building API integrations than other providers.

What makes LLMs fail hard: - Lack of context (LLMs are just not great at understanding what API endpoints exist and what they do, even if you give them documentation which we did) - Multi-step workflows (chaining API calls) - Complex API design: APIs like Square, PostHog, Asana (Forcing project selection among other things trips llms over)

We've open-sourced the benchmark so you can test any API and see where it ranks: https://github.com/superglue-ai/superglue/tree/main/packages...

Check out the repo, consider giving it a star, or see the full ranking at https://superglue.ai/api-ranking/

If you're building agents that need reliable API access, we'd love to hear your approach - or you can try our integration layer at superglue.ai.

Next up: benchmarking MCP.

sfaist•6mo ago
The reason we think this would be interesting to share here is that these llm benchmarks seem increasingly disconnected from reality. idc if the llm can solve a PhD math question or make scientific discoveries, I care if it can solve our problems, which in our case is automating API integrations. Turns out it mostly can't, which tracks well with our experience using cursor.
michael-fuest•6mo ago
Love hearing Sam Altman talking about feeling the AGI and seeing that million dollar reasoning models can't execute simple API calls despite having a lot of docs and the entire internet as baked-in knowledge.

There may be hope for humanity yet!

Jokes aside, interested in eventually exploring how well the new OpenAI agent mode handles these types of tasks if the underlying foundation models struggle with this type of work.

kyleledbetter•6mo ago
Super cool that you all published these benchmarks. We've seen similar, where some APIs work REALLY well with agents, but convoluted ones just product churn in the agent tool calls. Curious to see how Supabase's APIs would perform with your benchmarks. We've seen their PostgREST via API do really well with our agents (with targeted system prompts), and it's a fairly non-standard REST structure.
nimar•6mo ago
v interesting benchmark, looking forward to see it evolve over time. actually surprisingly good results already.

maybe add a couple harder APIs (or more complex queries) as well where current models overwhelmingly fail?

that way, we can still measure models in a couple of years against the current ones.

also adding o3 and for reference the model(s) used by superglue in this benchmark would be interesting.