frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: A new benchmark for testing LLMs for deterministic outputs

https://interfaze.ai/blog/introducing-structured-output-benchmark
20•khurdula•2h ago
When building workflows that rely on LLMs, we commonly use structured output for programmatic use cases like converting an invoice into rows or meeting transcripts into tickets or even complex PDFs into database entries.

The model may return the schema you want, but with hallucinated values like `invoice_date` being off by 2 months or the transcript array ordered wrongly. The JSON is valid, but the values are not.

Structured output today is a big part of using LLMs, especially when building deterministic workflows.

Current structured output benchmarks (e.g., JSONSchemaBench) only validate the pass rate for JSON schema and types, and not the actual values within the produced JSON.

So we designed the Structured Output Benchmark (SOB) that fixes this by measuring both the JSON schema pass rate, types, and the value accuracy across all three modalities, text, image, and audio.

For our test set, every record is paired with a JSON Schema and a ground-truth answer that was verified against the source context manually by a human and an LLM cross-check, so a missing or hallucinated value will be considered to be wrong.

Open source is doing pretty well with GLM 4.7 coming in number 2 right after GPT 5.4.

We noticed the rankings shift across modalities: GLM-4.7 leads text, Gemma-4-31B leads images, Gemini-2.5-Flash leads audio.

For example, GPT-5.4 ranks 3rd on text but 9th on images.

Model size is not a predictor, either: Qwen3.5-35B and GLM-4.7 beat GPT-5 and Claude-Sonnet-4.6 on Value Accuracy. Phi-4 (14B) beats GPT-5 and GPT-5-mini on text.

Structured hallucinations are the hardest bug. Such values are type-correct, schema-valid, and plausible, so they slip through most guardrails. For example, in one audio record, the ground truth is "target_market_age": "15 to 35 years", and a model returns "25 to 35". This is invisible without field-level checks.

Our goal is to be the best general model for deterministic tasks, and a key aspect of determinism is a controllable and consistent output structure. The first step to making structured output better is to measure it and hold ourselves against the best.

Comments

stared•46m ago
Thank you for sharing benchmark. However, the results are selective.

Why no Opus 4.7? Why Gemini 3.1 Pro is missing?

If there is some other criterion (e.g. models within certain time or budget), great - just make it explicit.

When I see "Top 5 at a glance" and it missed key frontier models, I am (at best) confused.

Flux159•32m ago
Agree that the choices are strange. Sonnet 4.6 was tested, but no Opus 4.6.

Gemini 3.1 and GLM 5 came out around the same time as Sonnet 4.6 (~Feb 2026) so it's strange that they are missing, but Gemini 2.5 Flash, Gemini 3 Flash, and GLM 4.7 are there.

zihotki•11m ago
I wonder if this benchmark brings any value. Models are already quite capable and reach high scores in it.
dalberto•6m ago
A benchmark without Opus 4.6/4.7 feels incomplete.

Zed 1.0

https://zed.dev/blog/zed-1-0
837•salkahfi•3h ago•291 comments

We need a federation of forges

https://blog.tangled.org/federation/
370•icy•4h ago•194 comments

The Abstraction Fallacy: Why AI can simulate but not instantiate consciousness

https://deepmind.google/research/publications/231971/
36•joshus•32m ago•23 comments

FastCGI: 30 years old and still the better protocol for reverse proxies

https://www.agwa.name/blog/post/fastcgi_is_the_better_protocol_for_reverse_proxies
67•agwa•1h ago•9 comments

Online age verification is the hill to die on

https://x.com/GlennMeder/status/2049088498163216560
282•Cider9986•2h ago•173 comments

Soft launch of open-source code platform for government

https://www.nldigitalgovernment.nl/news/soft-launch-for-government-open-source-code-platform/
435•e12e•8h ago•109 comments

Ghostty is leaving GitHub

https://mitchellh.com/writing/ghostty-leaving-github
3215•WadeGrimridge•22h ago•951 comments

Third Editor Fired in Elsevier's Citation Cartel Crackdown

https://www.chrisbrunet.com/p/third-editor-fired-in-elseviers-citation
58•RigbyTaro•2h ago•13 comments

Linux 7.0 Broke PostgreSQL: The Preemption Regression Explained

https://read.thecoder.cafe/p/linux-broke-postgresql
80•0xKelsey•2h ago•27 comments

An open-source stethoscope that costs between $2.5 and $5 to produce

https://github.com/GliaX/Stethoscope
57•0x54MUR41•3h ago•26 comments

How to Build the Future: Demis Hassabis [video]

https://www.youtube.com/watch?v=JNyuX1zoOgU
15•sandslash•3h ago•1 comments

Cursor Camp

https://neal.fun/cursor-camp/
46•bpierre•2h ago•8 comments

Show HN: A new benchmark for testing LLMs for deterministic outputs

https://interfaze.ai/blog/introducing-structured-output-benchmark
21•khurdula•2h ago•5 comments

Mistral Medium 3.5

https://mistral.ai/news/vibe-remote-agents-mistral-medium-3-5
218•meetpateltech•2h ago•123 comments

Making AI chatbots friendly leads to mistakes and support of conspiracy theories

https://www.theguardian.com/technology/2026/apr/29/making-ai-chatbots-more-friendly-mistakes-supp...
40•Cynddl•2h ago•23 comments

Letting AI play my game – building an agentic test harness to help play-testing

https://blog.jeffschomay.com/letting-ai-play-my-game
87•jschomay•5h ago•17 comments

GitHub – DOS 1.0: Transcription of Tim Paterson's DOS Printouts

https://github.com/DOS-History/Paterson-Listings
78•s2l•6h ago•4 comments

Stardex Is Hiring a Founding Customer Success Lead

https://www.ycombinator.com/companies/stardex/jobs/6GCK1HC-founding-customer-success-lead
1•sanketc•6h ago

Improving ICU handovers by learning from Scuderia Ferrari F1 team

https://healthmanagement.org/c/icu/IssueArticle/improving-handovers-by-learning-from-scuderia-fer...
43•embedding-shape•4h ago•42 comments

Maryland becomes first state to ban surveillance pricing in grocery stores

https://www.theguardian.com/technology/2026/apr/29/maryland-grocery-stores-ban-surveillance-pricing
42•01-_-•1h ago•13 comments

Bugs Rust won't catch

https://corrode.dev/blog/bugs-rust-wont-catch/
546•lwhsiao•15h ago•310 comments

Before GitHub

https://lucumr.pocoo.org/2026/4/28/before-github/
613•mlex•20h ago•200 comments

Show HN: Adblock-rust Manager – Firefox extension to enable the Brave ad blocker

https://github.com/electricant/adblock-rust-manager
62•electricant•5h ago•33 comments

Court Rules 2nd Amendment Covers Firearms Parts Good News Those Who Build Guns

https://cowboystatedaily.com/2026/04/28/court-rules-2nd-amendment-covers-firearms-parts-good-news...
59•Bender•1h ago•30 comments

How ChatGPT serves ads

https://www.buchodi.com/how-chatgpt-serves-ads-heres-the-full-attribution-loop/
460•lmbbuchodi•18h ago•317 comments

Laws of UX

https://lawsofux.com/
5•bobbiechen•1h ago•0 comments

Why Software Needs a Third Loop [audio]

https://www.heavybit.com/library/podcasts/third-loop/ep-3-give-it-a-name-why-software-needs-a-thi...
4•mooreds•1h ago•0 comments

Why AI companies want you to be afraid of them

https://www.bbc.com/future/article/20260428-ai-companies-want-you-to-be-afraid-of-them
232•rolph•2h ago•169 comments

Rise of the Forward Deployed Engineer

https://www.hfsresearch.com/research/fde-optional-ai-flywheel-spin/
4•nipponese•1h ago•1 comments

Shrdlu

https://en.wikipedia.org/wiki/SHRDLU
40•chistev•2h ago•4 comments