frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Structured Outputs Create False Confidence

https://boundaryml.com/blog/structured-outputs-create-false-confidence
41•gmays•1h ago

Comments

dzrmb•1h ago
Interesting read and perspective. I had very good results with structured outputs, both text, images and tool calling. Also a lot of SDKs are using it, including Vercel AI SDK.

Thanks for sharing

swe_dima•1h ago
OpenAI structured outputs are pretty stable for me. Gemini sometimes responds with a completely different structure. Gemini 3 flash with grounding sometimes returns json inside ```json...``` causing parsing errors.
codegladiator•1h ago
https://github.com/josdejong/jsonrepair

might be useful ( i am not the author )

euazOn•38m ago
In case you're using OpenRouter, check out their new Response Healing feature that claims to solve exactly this issue.

https://openrouter.ai/announcements/response-healing-reduce-...

cmews•1h ago
Structured outputs work well depending on the tasks. The example mentioned in the blog post output doesn’t say anything because we are missing the prompt/schema definition. Also quantity is quite ambiguous because it could be bananas as a term is readable once on the receipt.

I would love some more detailed and reproducible examples, because the claims don’t make sense for all use cases I had.

mikert89•1h ago
please i cant take anymore anti ai hot takes.
swiftcoder•55m ago
How is this an "anti AI hot take"? It's discussing using one type of LLM output versus another...
emp17344•5m ago
Sounds like a you problem. I’m all for people investigating the boundaries of model capability - if you take that as a personal attack, you’re going to have a bad time over the next few years.
NitpickLawyer•59m ago
A 3rd alternative is to use the best of both worlds. Have the model respond in free-form. Then use that response + structured output APIs to ask it for json. More expensive, but better overall results. (and you can cross-check between your heuristic parsing vs. the structured output, and retry / alert on miss-matches)
machinationu•57m ago
or tell it to output the data at the end as markdown and then do a second pass with a cheaper model to build the structured output

also, xml works much better than json, all the model guides say this

dcastm•47m ago
While I agree that you must be careful when using structured outputs, the article doesn't provide good arguments:

1. In the examples provided, the author compares freeform CoT + JSON output vs. non-CoT structured output. This is unfair and biases the results towards what they wanted to show. These days, you don't need to include a "reasoning" field in the schema as mentioned in the article; you can just use thinking tokens (e.g., reasoning_effort for OpenAI models). You get the best of both worlds: freeform reasoning and structured output. I tested this, and the results were very similar for both.

2. Let Me Speak Freely? had several methodological issues. I address some of them (and .txt's rebuttal) here: https://dylancastillo.co/posts/say-what-you-mean-sometimes.h...

3. There's no silver bullet. Structured outputs might improve or worsen your results depending on the use case. What you really need to do is run your evals and make a decision based on the data.

pizzathyme•43m ago
The very first example, which is held up as an error, is actually arguably correct. If you asked a human (me) how many bananas were purchased, they clearly purchased one banana.

Yes the banana weighs 0.4 pounds. But the question was not to return the weight or the quantity, the question was to return the quantity.

It seems like more instructions are needed in the prompt that the author is not even aware of.

simonw•34m ago
I'm not 100% convinced by this post. I'd like to see a more extensive formal eval that demonstrates that structured outputs from different providers reduces the quality of data extraction results.

Assuming this holds up, I wonder if a good workaround for this problem - the problem that turning on structured outputs makes errors more likely - would be to do this:

1. Prompt the LLM "extract numbers from this receipt, return data in this JSON format: ..." - without using the structured output mechanism.

2. If the returned JSON does indeed fit the schema then great, you're finished! But if it doesn't...

3. Round-trip the response from the previous call through the LLM again, this time with structured outputs configured. This should give you back the higher quality extracted data in the exact format you want.

A_SIGINT•31m ago
> Chain-of-thought is crippled by structured outputs

I don't know if this is true. Libraries such as Pydantic AI and I would assume the model provider SDKs stream different events. If COT is needed then a <think> section would be emitted and then later the structured response would occur when the model begins its final response.

Structured outputs can be quite reliable if used correctly. For example, I designed an AST structure that allows me to reliably generate SQL. The model has tools to inspect data-points, view their value distributions (quartiles, medians, etc). Then once I get the AST structure back I can perform semantic validation easily (just walk the tree like a compiler). Once semantic validation passes (or forces a re-prompt with the error), I can just walk the tree again to generate SQL. This helps me reliably generate SQL where I know it won't fail during execution, and have a lot of control over what data-points are used together, and ensuring valid values are used for them.

I think the trick is just generating the right schema to model your problem, and understanding the depth of an answer that might come back.

noreplydev•30m ago
I don’t know if 0,42 should be the quantity
supermdguy•22m ago
If your output schema doesn’t capture all correct outputs, that’s a problem with your schema, not the LLM. A human using a data entry tool would run into the wrong issue. Letting the LLM output whatever it wants just makes it so you have to deal with ambiguities manually, instead of teaching the LLM what to do.

I usually start by adding an error type that will be overused by the LLM, and use that to gain visibility into the types of ambiguities that come up in real-world data. Then over time you can build a more correct schema and better prompts that help the LLM deal with ambiguities the way you want it to.

Also, a lot of the chain of thought issues are solved by using a reasoning model (which allows chain of thought that isn’t included in the output) or by using an agentic loop with a tool call to return output.

dhruvbird•8m ago
This ^^^^

While the provided schema has a "quantity" field, it doesn't mention the units.

<code>

class Item(BaseModel):

    name: str

    price: float = Field(description="per-unit item price")

    quantity: float = Field(default=1, description="If not specified, assume 1")

class Receipt(BaseModel):

    establishment_name: str

    date: str = Field(description="YYYY-MM-DD")

    total: float = Field(description="The total amount of the receipt")

    currency: str = Field(description="The currency used for everything on the receipt")

    items: list[Item] = Field(description="The items on the receipt")
</code>

There needs to be a better evaluation and a better provided schema that captures the full details of what is expected to be captured.

> What kind of error should it return if there's no total listed on the receipt? Should it even return an error or is it OK for it to return total = null?

Additionally, the schema allows optional fields, so the LLM is free to skip missing fields if they are specified as such.

Aurornis•22m ago
Does anyone have more benchmarks or evals with data on this topic? The claimed 20% accuracy reduction is significant.

Structured output was one of the lesser known topics that AI consultants and course writers got a lot of mileage out of because it felt like magic. A lot of management people would use ChatGPT but didn’t know how to bridge the text output into a familiar API format, so using a trick to turn it into JSON felt like the missing link. Now that I think about it, I don’t recall seeing any content actually evaluating the impact of constrained output on quality though.

This blog post blurs the lines between output quality reduction and incorrect error handling, though. I’d like to see some more thorough benchmarking that doesn’t try to include obvious schema issues in the quality reduction measurements.

rybosome•2m ago
I have heard this argument before, but never actually seen concrete evals.

The argument goes that because we are intentionally constraining the model - I believe OAI’s method is a soft max (I think, rusty on my ML math) to get tokens sorted by probability then taking the first that aligns with the current state machine - we get less creativity.

Maybe, but a one-off vibes example is hardly proof. I still use structured output regularly.

Oh, and tool calling is almost certainly implemented atop structured output. After all, it’s forcing the model to respond with a JSON schema representing the tool arguments. I struggle to believe that this is adequate for tool calling but inadequate for general purpose use.

Decentralized YouTube Alternative PeerTube Adds Creator Mode

https://itsfoss.com/news/peertube-8-release/
1•MilnerRoute•54s ago•0 comments

I built a Telegram bot that creates tokens and liquidity pools in 60 seconds

https://ebpchain.com/
1•ebpbot•1m ago•1 comments

PyFrontKit – Create web pages from Python with minimal code

2•Edybrown•2m ago•0 comments

Autoland Saves King Air, Everyone Reported Safe

https://avbrief.com/autoland-saves-king-air-everyone-reported-safe/
1•bradleybuda•5m ago•0 comments

Our brains reboot at four key ages. This is how it feels

https://www.thetimes.com/uk/science/article/science-ageing-five-lifespan-epochs-8kkh35vkt
1•stevenjgarner•5m ago•0 comments

Show HN: Pixen 6 – New release of longstanding macOS pixel art editor

https://pixenapp.com/mac/
1•albertru90•8m ago•0 comments

Climate change has reduced U.S. income by an estimated 12%

https://news.arizona.edu/news/climate-changes-hidden-price-tag-drop-our-income
2•giuliomagnifico•9m ago•0 comments

A Price War Is Looming for Electric Vehicles

https://oilprice.com/Energy/Energy-General/A-Price-War-Is-Looming-for-Electric-Vehicles.html
2•PaulHoule•9m ago•0 comments

At Rome's New Subway Stations, Peruse Ancient Relics While Catching a Train

https://www.nytimes.com/2025/12/16/world/europe/rome-subway-stations-museum-archaeology.html
1•bookofjoe•10m ago•1 comments

Neural Basis of Schizophrenia and Bipolar Disorder

https://pubs.aip.org/aip/apb/article/9/3/036118/3364154/Machine-learning-enabled-detection-of
1•stevenjgarner•11m ago•0 comments

Local RAG

https://github.com/Andrej997/local-rag-workspaces
1•andrej_km•12m ago•0 comments

Chasing Your Tail

https://xania.org/202512/19-tail-call-optimisation
1•ibobev•16m ago•0 comments

Lessons learned streaming building a Scheme-like interpreter in Go

https://notes.eatonphil.com/2023-01-30-livescheme.html
1•ibobev•16m ago•0 comments

From Geometric to Coordinate Form: Understanding the Dot Product

https://www.4rknova.com//blog/2013/05/03/dot-product
2•ibobev•17m ago•0 comments

The WhatsApp Whisperer

https://www.whatsprivcy.com/
15•RyanJB•21m ago•3 comments

Is resumable LLM streaming hard? No, it's just annoying

https://stardrift.ai/blog/streaming-resumptions
2•gmays•26m ago•0 comments

Using Vectorize to build an unreasonably good search engine in 160 lines of code

https://blog.partykit.io/posts/using-vectorize-to-build-search/
1•ColinWright•26m ago•0 comments

Documents related to early Unix history before V7

https://github.com/thaliaarchi/unix-history
2•naves•27m ago•0 comments

China, Russia pulling ahead of NATO in Arctic drone capabilities: report

https://www.cbc.ca/news/politics/drones-arctic-russia-china-nato-9.7020149
1•Teever•30m ago•0 comments

Lend-Lease: How America Looted the British Empire During World War II [video]

https://www.youtube.com/watch?v=LMy3BAJYi54
6•thelastgallon•32m ago•0 comments

ELF Crimes: Program Interpreter Fun

https://nytpu.com/gemlog/2025-12-21
9•nytpu•33m ago•1 comments

Textpattern CMS 4.9.0

https://textpattern.com/weblog/textpattern-490-released-php-85-mysql-84-features-fixes-fine-tunin...
1•petecooper•37m ago•0 comments

Why Is Saudi Arabia Buying Everything? [video]

https://www.youtube.com/watch?v=YwAPL1HT714
1•thelastgallon•37m ago•1 comments

Dark Enlightenment

https://en.wikipedia.org/wiki/Dark_Enlightenment
2•throw0101c•38m ago•1 comments

ServerList: Compare VPS Prices and Performance

https://serverlist.dev/
1•satvikpendem•39m ago•0 comments

Ask HN: Why are AI code editors not continuously working?

1•sigalor•39m ago•3 comments

Show HN: Books mentioned on Hacker News in 2025

https://hackernews-readings-613604506318.us-west1.run.app
5•seinvak•42m ago•1 comments

Shooting myself in the foot with Git by accident

https://utcc.utoronto.ca/~cks/space/blog/programming/GitConcurrentUsageOops
1•birdculture•43m ago•0 comments

Show HN: AI LinkedIn Post Generator – Gemini-powered with 5 tone options

https://primedirectiveshop.danprice.ai/
1•pdai_exp•44m ago•4 comments

Show HN: Fixing Chase Travel's UI

https://rkdvis.com/chase-travel/
1•ktut•46m ago•0 comments