frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

https://github.com/valdanylchuk/breezydemo
256•isitcontent•19h ago•27 comments

Show HN: I spent 4 years building a UI design tool with only the features I use

https://vecti.com
355•vecti•21h ago•161 comments

Show HN: If you lose your memory, how to regain access to your computer?

https://eljojo.github.io/rememory/
329•eljojo•21h ago•199 comments

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

https://github.com/sandys/kappal
13•sandGorgon•2d ago•3 comments

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

https://github.com/phreda4/r3
79•phreda4•18h ago•14 comments

Show HN: Smooth CLI – Token-efficient browser for AI agents

https://docs.smooth.sh/cli/overview
94•antves•2d ago•70 comments

Show HN: MCP App to play backgammon with your LLM

https://github.com/sam-mfb/backgammon-mcp
3•sam256•3h ago•1 comments

Show HN: XAPIs.dev – Twitter API Alternative at 90% Lower Cost

https://xapis.dev
3•nmfccodes•58m ago•1 comments

Show HN: I'm 75, building an OSS Virtual Protest Protocol for digital activism

https://github.com/voice-of-japan/Virtual-Protest-Protocol/blob/main/README.md
6•sakanakana00•4h ago•1 comments

Show HN: I built Divvy to split restaurant bills from a photo

https://divvyai.app/
3•pieterdy•4h ago•1 comments

Show HN: Slack CLI for Agents

https://github.com/stablyai/agent-slack
52•nwparker•1d ago•11 comments

Show HN: BioTradingArena – Benchmark for LLMs to predict biotech stock movements

https://www.biotradingarena.com/hn
26•dchu17•23h ago•12 comments

Show HN: Artifact Keeper – Open-Source Artifactory/Nexus Alternative in Rust

https://github.com/artifact-keeper
152•bsgeraci•1d ago•64 comments

Show HN: ARM64 Android Dev Kit

https://github.com/denuoweb/ARM64-ADK
17•denuoweb•2d ago•2 comments

Show HN: Gigacode – Use OpenCode's UI with Claude Code/Codex/Amp

https://github.com/rivet-dev/sandbox-agent/tree/main/gigacode
19•NathanFlurry•1d ago•9 comments

Show HN: I Hacked My Family's Meal Planning with an App

https://mealjar.app
2•melvinzammit•6h ago•0 comments

Show HN: I built a free UCP checker – see if AI agents can find your store

https://ucphub.ai/ucp-store-check/
2•vladeta•6h ago•2 comments

Show HN: Compile-Time Vibe Coding

https://github.com/Michael-JB/vibecode
10•michaelchicory•8h ago•1 comments

Show HN: Micropolis/SimCity Clone in Emacs Lisp

https://github.com/vkazanov/elcity
173•vkazanov•2d ago•49 comments

Show HN: Slop News – HN front page now, but it's all slop

https://dosaygo-studio.github.io/hn-front-page-2035/slop-news
17•keepamovin•9h ago•5 comments

Show HN: Falcon's Eye (isometric NetHack) running in the browser via WebAssembly

https://rahuljaguste.github.io/Nethack_Falcons_Eye/
6•rahuljaguste•18h ago•1 comments

Show HN: Daily-updated database of malicious browser extensions

https://github.com/toborrm9/malicious_extension_sentry
14•toborrm9•1d ago•8 comments

Show HN: Horizons – OSS agent execution engine

https://github.com/synth-laboratories/Horizons
23•JoshPurtell•1d ago•5 comments

Show HN: Local task classifier and dispatcher on RTX 3080

https://github.com/resilientworkflowsentinel/resilient-workflow-sentinel
25•Shubham_Amb•1d ago•2 comments

Show HN: Fitspire – a simple 5-minute workout app for busy people (iOS)

https://apps.apple.com/us/app/fitspire-5-minute-workout/id6758784938
2•devavinoth12•11h ago•0 comments

Show HN: I built a RAG engine to search Singaporean laws

https://github.com/adityaprasad-sudo/Explore-Singapore
4•ambitious_potat•12h ago•4 comments

Show HN: Sem – Semantic diffs and patches for Git

https://ataraxy-labs.github.io/sem/
2•rs545837•13h ago•1 comments

Show HN: A password system with no database, no sync, and nothing to breach

https://bastion-enclave.vercel.app
12•KevinChasse•1d ago•16 comments

Show HN: Craftplan – I built my wife a production management tool for her bakery

https://github.com/puemos/craftplan
568•deofoo•5d ago•166 comments

Show HN: GitClaw – An AI assistant that runs in GitHub Actions

https://github.com/SawyerHood/gitclaw
10•sawyerjhood•1d ago•0 comments
Open in hackernews

Show HN: LLMs suck at writing integration code… for now

https://github.com/superglue-ai/superglue/tree/main/packages/core/eval/api-ranking
20•sfaist•6mo ago
Hi HN! Stefan here from superglue and today I’d like to share a new benchmark we’ve just open sourced: an Agent-API Benchmark, in which we test how well LLMs handle APIs.

We gave LLMs API documentation and asked them to write code that makes actual API calls. Things like "create a Stripe customer" or "send a Slack message". We're not testing if they can use SDKs; we're testing if they can write raw HTTP requests (with proper auth, headers, body formatting) that actually work when executed against real API endpoints and can extract relevant information from that response.

tl:dr: LLMs suck at writing code to use APIs.

We ran 630 integration tests across 21 common APIs (Stripe, Slack, GitHub, etc.) using 6 different LLMs. Here are our key findings:

- Best general LLM: 68% success rate. That's 1 in 3 API calls failing, which most would agree isn’t viable in production

- Our integration layer scored a 91% success rate, showing us that just throwing bigger/better LLMs at the problem won't solve it.

- Only 6 out of 21 APIs worked 100% of the time, every other API had failures.

- Anthropic’s models are significantly better at building API integrations than other providers.

Here is the results chart: https://superglue.ai/files/performance.png

What made LLMs fail:

- Lack of context (LLMs are just not great at understanding what API endpoints exist and what they do, even if you give them documentation which we did)

- Multi-step workflows (chaining API calls)

- Complex API design: APIs like Square, PostHog, Asana (Forcing project selection among other things trips llms over)

We've open-sourced the benchmark so you can test any API and see where it ranks: https://github.com/superglue-ai/superglue/tree/main/packages...

Check out the repo, consider giving it a star, or see the full ranking at https://superglue.ai/api-ranking/.

If you're building agents that need reliable API access, we'd love to hear your approach, or you can try our integration layer at superglue.ai.

Next up: benchmarking MCP.

Comments

adinagoerres•6mo ago
Hey HN, I'm Adina, Stefan's co-founder at superglue. When we started working on LLM-powered integrations about a year ago, the models were barely good enough to handle simple mappings. We started benchmarking our performance as an internal evals project and thought it would be fun to open source it, to create more transparency around LLM performance. Our goal here is to understand how we can make agents production-ready and improve reliability across the board.
hoerzu•6mo ago
Love the benchmarks. Is better to use single LLM for performance or would always advise to add a self reflection step
adinagoerres•6mo ago
self-reflection is very important for both humans and LLMs, indeed
hoerzu•6mo ago
What's the hello world of super glue?
ForzaAaRon•6mo ago
Fascinating read. Interesting how opus performs worse compared to sonnet
sfaist•6mo ago
Quite interesting actually. not sure why, I assume it just overthinks. What suprised me even more is how bad o4-mini performed, after taking up hours of evaluation time and more credits than all other llms combined. More thinking != better (integration) coding performance
iamflimflam1•6mo ago
I would expect most developers to fail at this challenge. Here’s the doc - you’ve got one chance to get the API to do this.

I can’t tell from the description if the LLMs are allowed to try and then correct based on any errors received.

Though it would be surprising if that helped. Most APIs don’t tell you what you’ve done wrong…

sfaist•6mo ago
We would've assumed that the llms are much better at writing working code since it's not random APIs but rather established API patterns which they should be able to one-shot (e.g. Stripe). Bad error messages are a problem indeed. We will release another one with retries very soon.
danmeier•6mo ago
very interesting! curious to see the benchmarks for MCP!
ThomasMin•6mo ago
Awesome work Stefan, this is super insightful! Really appreciate the transparency and open-sourcing the benchmark. The 68% success rate is a wake-up call for anyone building with LLMs. Your 91% integration layer result is impressive, shows tooling matters. Excited to see what you uncover next with MCP!
maxprokopp•6mo ago
Exciting benchmarks, great work Adina and Stefan!
hande-k•6mo ago
Really appreciate you sharing this. What I am trying to use is gpt o3, so would be curious to see it in the benchmarks. Still seeing the raw traces tells me the tooling is starting to cross the “actually usable” line and makes me want to try on my examples this weekend. Looking forward to the MCP benchmark as well.
mutant•6mo ago
Thanks for the self host option. I tried the slack example and was very impressed with results, thank you!