frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

How HN: A natural language calorie tracker that logs to Google Sheet in terminal

https://github.com/csawai/calorie-tracker
1•csawai•1m ago•0 comments

Search Isn't Going Anywhere

https://ossama.is/writing/search
1•ossa-ma•3m ago•0 comments

OpenObserve Raises $10M Series A

https://openobserve.ai/blog/series-a-announcement/
1•prabhatsharma•4m ago•0 comments

GraphQL wasn't made for AI. But it might be one of the best ways to talk to it

https://chillicream.com/blog/2026/04/22/semantic-introspection/
1•pascal_senn•6m ago•0 comments

Finding and Fixing 24 CVEs in WeKan

https://aisle.com/blog/finding-and-fixing-24-cves-in-wekan-with-aisles-analyzer
1•mmsc•6m ago•0 comments

Warp's gambles its AI tool going open source will help it take on closed rivals

https://thenewstack.io/warp-open-source-client/
1•CrankyBear•7m ago•1 comments

CKKS – Polynomials, the Canonical Embedding, and Encoding

https://www.jeremykun.com/2026/04/29/ckks-polynomials-the-canonical-embedding-and-encoding/
1•ibobev•8m ago•0 comments

What if you tried hard?

https://aaronfrancis.com/2024/what-if-you-tried-hard-dac139a5
1•_vaporwave_•8m ago•0 comments

Stripe link-CLI: Secure one-time-use payment credentials from a Link wallet

https://github.com/stripe/link-cli
1•Olshansky•11m ago•0 comments

Opus 4.7 knows the real Kelsey

https://www.theargumentmag.com/p/i-can-never-talk-to-an-ai-anonymously
3•ilamont•12m ago•0 comments

Why JSON Schema matters more than ever in the age of generative AI

https://thenewstack.io/json-schema-ai-reliability/
1•Brajeshwar•12m ago•0 comments

Show HN: Crforest – Competing-risks RSF in Python, 6× faster than R's rfSRC

https://github.com/sunnyadn/crforest
1•sunnyadn•16m ago•0 comments

Windows K2 with faster start menu, less ads and AI, to win back user trust

https://www.windowscentral.com/microsoft/windows-11/what-is-windows-k2-everything-you-need-to-kno...
3•workfromspace•17m ago•2 comments

I got stood up by an AI agent, and tracked down its human owner in China

https://restofworld.org/2026/ai-agent-china-one-person-company/
3•speckx•18m ago•0 comments

Why a recent supply-chain attack singled out security firms Checkmarx and Bitwa

https://arstechnica.com/information-technology/2026/04/why-a-recent-supply-chain-attack-singled-o...
1•joozio•19m ago•0 comments

Ghost is now a digital public good

https://ghost.org/changelog/digital-public-good/
2•cdrnsf•19m ago•1 comments

The Design of High Performance Mechatronics(2020)

https://annas-archive.gl/md5/724e29591bb37ff0944399da5713ed77
1•num42•19m ago•1 comments

Give First, Build Right with Eric Ries

https://feld.com/archives/2026/04/give-first-build-right-with-eric-ries/
2•wslh•20m ago•0 comments

Tindie Now Owned by EETree

https://blog.adafruit.com/2026/04/29/tindie-is-back-online-and-now-owned-by-eetree-llc-a-suzhou-f...
1•abetusk•20m ago•1 comments

Address by King Charles III Before the U.S. Congress

https://brucebartlett.substack.com/p/address-by-king-charles-iii-before
1•zdw•21m ago•0 comments

A New Drug Concept to Treat Obesity and Type 2 Diabetes

https://idw-online.de/en/news870041
2•geox•21m ago•0 comments

The Emancipation of the Russia's Serfs, Part I: The Gift the Cost Everything

https://russianartandempire.substack.com/p/the-emancipation-of-russias-serfs
2•jerrybmarchant•22m ago•0 comments

Laws of UX

https://lawsofux.com/
3•bobbiechen•22m ago•0 comments

Tell HN: Apple iOS Password app loses passwords after added

2•kingleopold•22m ago•1 comments

Why Software Needs a Third Loop [audio]

https://www.heavybit.com/library/podcasts/third-loop/ep-3-give-it-a-name-why-software-needs-a-thi...
3•mooreds•25m ago•0 comments

Rise of the Forward Deployed Engineer

https://www.hfsresearch.com/research/fde-optional-ai-flywheel-spin/
3•nipponese•26m ago•1 comments

The Chip That Made Hardware Rewriteable

https://spectrum.ieee.org/fpga-chip-ieee-milestone
2•Brajeshwar•28m ago•0 comments

Virtualisation on Apple Silicon Macs is different

https://eclecticlight.co/2026/04/29/virtualisation-on-apple-silicon-macs-is-different/
2•zdw•29m ago•0 comments

Google Moves Forward with Pentagon AI Deal Despite Employee Pushback

https://www.cnet.com/tech/services-and-software/google-reportedly-signs-pentagon-ai-deal-despite-...
3•01-_-•30m ago•0 comments

Maryland becomes first state to ban surveillance pricing in grocery stores

https://www.theguardian.com/technology/2026/apr/29/maryland-grocery-stores-ban-surveillance-pricing
7•01-_-•31m ago•0 comments
Open in hackernews

Show HN: A new benchmark for testing LLMs for deterministic outputs

https://interfaze.ai/blog/introducing-structured-output-benchmark
12•khurdula•1h ago
When building workflows that rely on LLMs, we commonly use structured output for programmatic use cases like converting an invoice into rows or meeting transcripts into tickets or even complex PDFs into database entries.

The model may return the schema you want, but with hallucinated values like `invoice_date` being off by 2 months or the transcript array ordered wrongly. The JSON is valid, but the values are not.

Structured output today is a big part of using LLMs, especially when building deterministic workflows.

Current structured output benchmarks (e.g., JSONSchemaBench) only validate the pass rate for JSON schema and types, and not the actual values within the produced JSON.

So we designed the Structured Output Benchmark (SOB) that fixes this by measuring both the JSON schema pass rate, types, and the value accuracy across all three modalities, text, image, and audio.

For our test set, every record is paired with a JSON Schema and a ground-truth answer that was verified against the source context manually by a human and an LLM cross-check, so a missing or hallucinated value will be considered to be wrong.

Open source is doing pretty well with GLM 4.7 coming in number 2 right after GPT 5.4.

We noticed the rankings shift across modalities: GLM-4.7 leads text, Gemma-4-31B leads images, Gemini-2.5-Flash leads audio.

For example, GPT-5.4 ranks 3rd on text but 9th on images.

Model size is not a predictor, either: Qwen3.5-35B and GLM-4.7 beat GPT-5 and Claude-Sonnet-4.6 on Value Accuracy. Phi-4 (14B) beats GPT-5 and GPT-5-mini on text.

Structured hallucinations are the hardest bug. Such values are type-correct, schema-valid, and plausible, so they slip through most guardrails. For example, in one audio record, the ground truth is "target_market_age": "15 to 35 years", and a model returns "25 to 35". This is invisible without field-level checks.

Our goal is to be the best general model for deterministic tasks, and a key aspect of determinism is a controllable and consistent output structure. The first step to making structured output better is to measure it and hold ourselves against the best.

Comments

stared•5m ago
Thank you for sharing benchmark. However, the results are selective.

Why no Opus 4.7? Why Gemini 3.1 Pro is missing?

If there is some other criterion (e.g. models within certain time or budget), great - just make it explicit.

When I see "Top 5 at a glance" and it missed key frontier models, I am (at best) confused.