Show HN: A new benchmark for testing LLMs for deterministic outputs

https://interfaze.ai/blog/introducing-structured-output-benchmark

12•khurdula•1h ago

When building workflows that rely on LLMs, we commonly use structured output for programmatic use cases like converting an invoice into rows or meeting transcripts into tickets or even complex PDFs into database entries.

The model may return the schema you want, but with hallucinated values like `invoice_date` being off by 2 months or the transcript array ordered wrongly. The JSON is valid, but the values are not.

Structured output today is a big part of using LLMs, especially when building deterministic workflows.

Current structured output benchmarks (e.g., JSONSchemaBench) only validate the pass rate for JSON schema and types, and not the actual values within the produced JSON.

So we designed the Structured Output Benchmark (SOB) that fixes this by measuring both the JSON schema pass rate, types, and the value accuracy across all three modalities, text, image, and audio.

For our test set, every record is paired with a JSON Schema and a ground-truth answer that was verified against the source context manually by a human and an LLM cross-check, so a missing or hallucinated value will be considered to be wrong.

Open source is doing pretty well with GLM 4.7 coming in number 2 right after GPT 5.4.

We noticed the rankings shift across modalities: GLM-4.7 leads text, Gemma-4-31B leads images, Gemini-2.5-Flash leads audio.

For example, GPT-5.4 ranks 3rd on text but 9th on images.

Model size is not a predictor, either: Qwen3.5-35B and GLM-4.7 beat GPT-5 and Claude-Sonnet-4.6 on Value Accuracy. Phi-4 (14B) beats GPT-5 and GPT-5-mini on text.

Structured hallucinations are the hardest bug. Such values are type-correct, schema-valid, and plausible, so they slip through most guardrails. For example, in one audio record, the ground truth is "target_market_age": "15 to 35 years", and a model returns "25 to 35". This is invisible without field-level checks.

Our goal is to be the best general model for deterministic tasks, and a key aspect of determinism is a controllable and consistent output structure. The first step to making structured output better is to measure it and hold ourselves against the best.

Comments

stared•5m ago

Thank you for sharing benchmark. However, the results are selective.

Why no Opus 4.7? Why Gemini 3.1 Pro is missing?

If there is some other criterion (e.g. models within certain time or budget), great - just make it explicit.

When I see "Top 5 at a glance" and it missed key frontier models, I am (at best) confused.

How HN: A natural language calorie tracker that logs to Google Sheet in terminal

Search Isn't Going Anywhere

OpenObserve Raises $10M Series A

GraphQL wasn't made for AI. But it might be one of the best ways to talk to it

Finding and Fixing 24 CVEs in WeKan

Warp's gambles its AI tool going open source will help it take on closed rivals

CKKS – Polynomials, the Canonical Embedding, and Encoding

What if you tried hard?

Stripe link-CLI: Secure one-time-use payment credentials from a Link wallet

Opus 4.7 knows the real Kelsey

Why JSON Schema matters more than ever in the age of generative AI

Show HN: Crforest – Competing-risks RSF in Python, 6× faster than R's rfSRC

Windows K2 with faster start menu, less ads and AI, to win back user trust

I got stood up by an AI agent, and tracked down its human owner in China

Why a recent supply-chain attack singled out security firms Checkmarx and Bitwa

Ghost is now a digital public good

The Design of High Performance Mechatronics(2020)

Give First, Build Right with Eric Ries

Tindie Now Owned by EETree

Address by King Charles III Before the U.S. Congress

A New Drug Concept to Treat Obesity and Type 2 Diabetes

The Emancipation of the Russia's Serfs, Part I: The Gift the Cost Everything

Laws of UX

Tell HN: Apple iOS Password app loses passwords after added

Why Software Needs a Third Loop [audio]

Rise of the Forward Deployed Engineer

The Chip That Made Hardware Rewriteable

Virtualisation on Apple Silicon Macs is different

Google Moves Forward with Pentagon AI Deal Despite Employee Pushback

Maryland becomes first state to ban surveillance pricing in grocery stores