frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: A new benchmark for testing LLMs for deterministic outputs

https://interfaze.ai/blog/introducing-structured-output-benchmark
23•khurdula•2h ago
When building workflows that rely on LLMs, we commonly use structured output for programmatic use cases like converting an invoice into rows or meeting transcripts into tickets or even complex PDFs into database entries.

The model may return the schema you want, but with hallucinated values like `invoice_date` being off by 2 months or the transcript array ordered wrongly. The JSON is valid, but the values are not.

Structured output today is a big part of using LLMs, especially when building deterministic workflows.

Current structured output benchmarks (e.g., JSONSchemaBench) only validate the pass rate for JSON schema and types, and not the actual values within the produced JSON.

So we designed the Structured Output Benchmark (SOB) that fixes this by measuring both the JSON schema pass rate, types, and the value accuracy across all three modalities, text, image, and audio.

For our test set, every record is paired with a JSON Schema and a ground-truth answer that was verified against the source context manually by a human and an LLM cross-check, so a missing or hallucinated value will be considered to be wrong.

Open source is doing pretty well with GLM 4.7 coming in number 2 right after GPT 5.4.

We noticed the rankings shift across modalities: GLM-4.7 leads text, Gemma-4-31B leads images, Gemini-2.5-Flash leads audio.

For example, GPT-5.4 ranks 3rd on text but 9th on images.

Model size is not a predictor, either: Qwen3.5-35B and GLM-4.7 beat GPT-5 and Claude-Sonnet-4.6 on Value Accuracy. Phi-4 (14B) beats GPT-5 and GPT-5-mini on text.

Structured hallucinations are the hardest bug. Such values are type-correct, schema-valid, and plausible, so they slip through most guardrails. For example, in one audio record, the ground truth is "target_market_age": "15 to 35 years", and a model returns "25 to 35". This is invisible without field-level checks.

Our goal is to be the best general model for deterministic tasks, and a key aspect of determinism is a controllable and consistent output structure. The first step to making structured output better is to measure it and hold ourselves against the best.

Comments

stared•1h ago
Thank you for sharing benchmark. However, the results are selective.

Why no Opus 4.7? Why Gemini 3.1 Pro is missing?

If there is some other criterion (e.g. models within certain time or budget), great - just make it explicit.

When I see "Top 5 at a glance" and it missed key frontier models, I am (at best) confused.

Flux159•51m ago
Agree that the choices are strange. Sonnet 4.6 was tested, but no Opus 4.6.

Gemini 3.1 and GLM 5 came out around the same time as Sonnet 4.6 (~Feb 2026) so it's strange that they are missing, but Gemini 2.5 Flash, Gemini 3 Flash, and GLM 4.7 are there.

khurdula•9m ago
Yeah we selected models that are most commonly integrated in developer workflows and being used for structured output. Typically those models tend to be in the low -mid cost range and with no or low reasoning.

For the benchmark, was kept consistent across all models and typically opus and 3.1 pro would be overkill and expensive even with reasoning off.

Good point tho, will add this point in the blog too :)

Also the benchmark is open source, so anyone can run a model on it and create a PR too, the leaderboard is dynamic and will automatically add that in.

zihotki•31m ago
I wonder if this benchmark brings any value. Models are already quite capable and reach high scores in it.
khurdula•5m ago
Check out the "The JSON-pass vs Value-Accuracy gap" section in the blog. That was an eye opener.

While most models were great at producing JSON schema, they were pretty bad at producing accurate values.

In the graph you'll is almost a 20%-30% drop between the JSON schema pass vs the value accuracy.

dalberto•26m ago
A benchmark without Opus 4.6/4.7 feels incomplete.
iLoveOncall•22m ago
This is just a hallucinations benchmark on a subset of outputs, not sure there's a value over general hallucinations benchmarks?

> Our goal is to be the best general model for deterministic tasks

I'm sorry but this simply doesn't make sense. If you want a deterministic output don't use an LLM.

broyojo•13m ago
hmm why can't structured decoding be used?

Show HN: A new benchmark for testing LLMs for deterministic outputs

https://interfaze.ai/blog/introducing-structured-output-benchmark
23•khurdula•2h ago•8 comments

Show HN: Adblock-rust Manager – Firefox extension to enable the Brave ad blocker

https://github.com/electricant/adblock-rust-manager
68•electricant•5h ago•33 comments

Show HN: Auto-Architecture: Karpathy's Loop, pointed at a CPU

https://github.com/FeSens/auto-arch-tournament/blob/main/docs/auto-arch-tournament-blog-post.md
219•fesens•1d ago•70 comments

Show HN: Send your first Peppol e-invoice in 5 minutes (EU mandate live)

https://getpeppr.dev/
2•zerolooplabs•45m ago•0 comments

Show HN: Rocky – Rust SQL engine with branches, replay, column lineage

https://github.com/rocky-data/rocky
107•hugocorreia90•1d ago•39 comments

Show HN: My retired dad and I made a daily, somewhat difficult, quiz

https://kviss.eu/
18•steinvakt2•4h ago•6 comments

Show HN: Rip.so – a graveyard for dead internet things

https://rip.so
154•bozdemir•9h ago•107 comments

Show HN: AgentPort – Open-source Security Gateway For Agents

https://agentport.sh/
5•yakkomajuri•1h ago•1 comments

Show HN: Drive any macOS app in the background without stealing the cursor

https://github.com/trycua/cua
169•frabonacci•1d ago•38 comments

Show HN: Study Bible MCP – scholarly Greek/Hebrew lexicons and morphology

https://github.com/djayatillake/studybible-mcp
7•DSJayatillake•2h ago•9 comments

Show HN: Live Sun and Moon Dashboard with NASA Footage

https://www.lumara-space.app/
208•beeswaxpat•1d ago•64 comments

Show HN: Platypus – Local meeting transcription, notes, and chat (Tauri, Rust)

https://platypusnotes.com/
3•pixelmash13•3h ago•0 comments

Show HN: AI Skills Leaderboard. What's your score?

https://aisa.to
2•Ozzie-D•4h ago•0 comments

Show HN: fixiproject.org – minimalist web tools

https://fixiproject.org
2•recursivedoubts•4h ago•0 comments

Show HN: A private-ish bookmark app that uses GitHub Gist as its back end

https://github.com/chrisdiana/gistkeep
3•inflam52•5h ago•2 comments

Show HN: Stateless, system-wide Transparent Tor Proxy for Linux (v0.1.0)

https://github.com/onyks-os/TransparentTorProxy
2•onyks•5h ago•0 comments

Show HN: OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview

https://github.com/dirac-run/dirac
385•GodelNumbering•2d ago•143 comments

Show HN: Utilyze – an open source GPU monitoring tool more accurate than nvtop

https://www.systalyze.com/utilyze
123•ManyaGhobadi•2d ago•28 comments

Show HN: A terminal spreadsheet editor with Vim keybindings

https://github.com/garritfra/cell
122•garritfra•2d ago•51 comments

Show HN: GitChop – Git rebase -I without the TODO file

https://bendansby.com/apps/gitchop.html
3•webwielder2•3h ago•1 comments

Show HN: TiGrIS, a tiling compiler that fits ML models onto embedded devices

https://github.com/raws-labs/tigris
20•asteinh•11h ago•0 comments

Show HN: I built a Chinese learning app that teaches through sentence patterns

https://doudou-chinese.com/
7•vojd•7h ago•4 comments

Show HN: I wrote a DOOM clone in my own programming language

https://spectrelang.org/log/devlog#cubedoom
21•pizza_man•1d ago•5 comments

Show HN: Waiting for LLMs Suck – Give your user a game

https://github.com/ftaip/waiting-game
35•dalemhurley•1d ago•16 comments

Show HN: Pi-hosts – Give the Pi coding agent access to your servers

https://github.com/hunvreus/pi-hosts
18•hunvreus•16h ago•0 comments

Show HN: GeoTraceroute – Traceroutes on a 3D globe and submarine cables

https://geotraceroute.com
21•Himred•16h ago•1 comments

Show HN: The Unix Magic poster, annotated (updated)

https://github.com/drio/unixmagic
74•drio•2d ago•7 comments

Show HN: Tiao, A two-player turn-based board game

https://playtiao.com
74•trebeljahr•2d ago•37 comments

Show HN: Free textbook on engineering thermodynamics

https://thermodynamicsbook.com/
187•2DcAf•3d ago•47 comments

Show HN: 49Agents – 2D Canvas IDE for Orchestrating Agents, Repos, Issues

https://github.com/49Agents/49Agents
21•alpadurza•18h ago•2 comments