Structured Outputs in LLMs

https://parthsareen.com/blog.html#sampling.md

79•SamLeBarbare•3h ago

Comments

SamLeBarbare•3h ago

This post dives into that "black magic" layer, especially in the context of emerging thinking models and tools like Ollama or GPT-OSS. It’s a thoughtful look at why sampling, formatting, and standardization are not just implementation details, but core to the future of working with LLMs.

electroglyph•1h ago

I don't know if you're purposely trying to be funny, but this is obnoxious, lol

k__•2h ago

Sounds like brute force to me.

myflash13•2h ago

Hmm, so if structured output affects the quality of the response maybe it's better to convert the output to a structured format as a post-processing step?

NitpickLawyer•1h ago

It's a tradeoff between getting "good enough" performance w/ guided/constrained generation and using 2x calls to do the same task. Sometimes it works, sometimes it's better to have a separate model. One good case of 2 calls is the "code merging" thing, where you "chat" with a model giving it a source file + some instruction, and if it replies with something like ... //unchanged code here ... some new code ... //the rest stays the same, then you can use a code merging model to apply the changes. But that's become somewhat obsolete by the new "agentic" capabilities where models learn how to diff files directly.

BoredPositron•1h ago

Haiku is my favorite model for the second pass. It's small cheap and usually gets it right. If I see hallucinations they are mostly from the base model in the first pass.

ninadpathak•41m ago

Depends on your use case. Post-processing can save headaches when soft constraints are fine or you want max flexibility, but you risk subtle errors slipping by. For API responses or anything that gets parsed downstream, I still trust grammar-constrained generation more—it just surfaces problems earlier.

thrance•2h ago

It's still baffling to me that the various API providers don't let us upload our custom grammars. It would enable so many use cases, like HTML generation for example, at essentially no cost on their part.

bubblyworld•1h ago

Wouldn't that have implications for inference batching, since you would have to track state and apply a different mask for each sequence in the batch? If so, I think it would directly affect utilisation and hence costs. But I could be talking out of my ass here.

esafak•1h ago

When you say custom grammar, do you mean something other than a JSON schema, because they support that?

barrkel•1h ago

Using grammar constrained output in llama.cpp - which has been available for ages and I think is a different implementation to the one described here - does slow down generation quite a bit. I expect it has a naive implementation.

As to why providers don't give you a nice API, maybe it's hard to implement efficiently.

It's not too bad if inference is happening token by token and reverting to the CPU every time, but I understand high performance LLM inference uses speculative decoding, with a smaller model guessing multiple tokens in advance and the main model doing verification. Doing grammar constraints across multiple tokens is tougher, there's an exponential number of states that need precomputing.

So you'd need to think about putting the parser automaton onto the GPU/TPU and use it during inference without needing to stall a pipeline by going back CPU.

And then you start thinking about how big that automaton is going to be. How many states, pushdown stack. You're basically taking code from the API call and running it on your hardware. There's dragons here, around fair use, denial of service etc.

FlyingLawnmower•52m ago

If your masking is fast enough, you can make it easily work with spec dec too :). We manage to keep this on CPU. Some details here: https://github.com/guidance-ai/llguidance/blob/main/docs/opt...

FlyingLawnmower•54m ago

OpenAI has started to (at least for tool calls): https://platform.openai.com/docs/guides/function-calling#con...

2THFairy•47m ago

There are some implementation concerns, but the real answer is that it is an ideological choice.

The AI companies believe that these kinds of grammar mistakes will be solved by improving the models. To build out tools for grammar constrained inference like this is to suggest, on some level, that GPT-N+1 won't magically solve the problem.

The deeper level is that it's not just simple grammar constraints. Constraining to JSON is a nice party trick, but it opens the door to further ideas. How about constraining to a programming language's grammar? Those are well defined, you just swap the JSON grammar file for the Java grammar file, job done.

We can go further: Why not use a language server to constrain not only the grammar but also the content? What variables and functions are in-scope is known, constraining a variable reference or function call to one of their names can be done with the same techique as grammar constraints. ("monitor-guided decoding", figured out back in 2023)

Entire classes of hallucination problems can be eliminated this way. The marketing writes itself; "Our AI is literally incapable of making the errors humans make!"

What many AI developers, firms, and especially their leaders find grating about this is the implication. That AI is fallible and has to be constrained.

Another such inconvenience is that while these techniques improve grammar they highlight semantic problems. The code is correct & compiles, it just does the wrong thing.

nl•23m ago

Most API providers (Together, Fireworks etc) don't build their own models.

ijk•18m ago

Though Fireworks is one of the few providers that supports structured generation.

ijk•19m ago

One pattern that I've seen develop (in PydanticAI and elsewhere) is to constrain the output but include an escape hatch. If an error happens, that lets it bail out and report the problem rather than be forced to proceed down a doomed path.

weberer•36m ago

Fireworks does. It is frustrating that AWS/Google/Azure do not.

https://fireworks.ai/docs/structured-responses/structured-ou...

frotaur•1h ago

When doing structured sampling, why is the token sampled, checked against the grammar, and resampled if it's wrong by applying the mask ?

Why wouldn't we apply the mask immediately for the first sampling? Is this an optimization somehow, is masking expensive?

2THFairy•1h ago

Implementation preference.

> is masking expensive?

It's not expensive per-se; A single element-wise multiplication of the output vector.

The real "expense" is that you need to prepare masks for every element of your grammar as they are expensive to recompute as needed; LLM tokens do not cleanly map onto elements of your grammar. (Consider JSON: LLM tokens often combine various special characters such as curly braces, colons, and quotes.)

This isn't that hard to compute, it's just more work to implement.

FlyingLawnmower•45m ago

If you can screen tokens against your grammar fast enough, you can build a bitmask over the entire token vocabulary and apply it right before sampling. As vocabulary sizes grow, this gets more complex to do in real time, but we (and other libraries) have found several optimizations to do this extremely quickly (eg for guidance, we detail some optimizations here https://github.com/guidance-ai/llguidance/blob/main/docs/opt...).

Other libraries work by essentially pre-computing all the masks for all possible generations, but of course you're restricted to working with simple grammars in this case (like a subset of regular expressions)

ninadpathak•43m ago

Good question—some frameworks do apply the mask immediately, others defer for performance or implementation simplicity. Mask precomputation can get tricky with large vocabularies, especially if grammar elements span multiple tokens. Immediate masking is usually preferred, but optimizations kick in when you're juggling complicated grammars or working against throughput bottlenecks.

amelius•1h ago

This constrains the output of the LLM to some grammar.

However, why not use a grammar that does not have invalid sentences, and from there convert to any grammar that you want?

cyptus•1h ago

What if the converted version is not in the wanted syntax?

NitpickLawyer•1h ago

Constrained generation guarantees syntax. It does not guarantee semantic correctness tho. Imagine you want a json object with "hp" and "damage". If you use a grammar, the model will be forced to output a json object with those two values. But it's not guaranteed to get sensible values.

With a 2nd pass you basically "condition" it on the text right above, hoping to get better semantic understanding.

lyu07282•1h ago

I'm pretty sure the grammar is generated from the Json schema, it doesn't just constrain json syntax, it constraints on the schema (including enums and such). The schema is also given to the model (at least in openai) you can put instructions in the json schema as well that will be taken into account.

NitpickLawyer•34m ago

Perhaps I worded that poorly. What I mean by semantic correctness is that the model could output nonsensical values for some things. Say in a game, "normal" health is ~100hp and the model creates a wizard with 50hp but then a mouse with 10000hp. So you're guaranteed to get a parsable json object (syntactically correct) but what the values are in that json is not guaranteed to make sense in the given context.

CuriouslyC•1h ago

Just wait till people realize that if you have agents speak in structured output rather than chatting with you, your observability and ability to finely program your agent goes through the roof.

electroglyph•1h ago

If the current position in the structure only has one possibility (like a comma, bracket, etc.) do you just force that as the next token and continue?

maccam912•1h ago

I don't think so, because multiple tokens might match. If it needs a comma as the next character, but you have tokens for `, "blah` and `, "foo` you still want to leave those on the table.

FlyingLawnmower•55m ago

We do enable forcing these sequences of tokens in guidance, and find that it significantly speeds up structured generation. There are tricky alignment issues to make sure you pick the right sequence of tokens, but you can often proxy this well by using the model's native tokenizer. Some details here in an old blog: https://guidance.readthedocs.io/en/latest/example_notebooks/...

ninadpathak•42m ago

In most cases, yes—forcing is common when the grammar dictates a single valid option. It's a fast path. Trickier cases arise if multiple tokens could satisfy the same grammar position, especially with weird tokenizations or BPE merges. Edge cases can trip token selection, but for things like brackets/commas, forced emission usually works flawlessly.

iandanforth•1h ago

This is a great writeup! There was a period where reliable structured output was a significant differentiator and was the 'secret sauce' behind some companies success. A NL->SQL company I am familiar with comes to mind. Nice to see this both public and supported by a growing ecosystem of libraries.

One statement surprised me was that the author thinks "models over time will just be able to output JSON perfectly without the need for constraining over time."

I'm not sure how this conclusion was reached. "Perfectly" is a bar that probabilistic sampling cannot meet.

ninadpathak•43m ago

You're spot on about the "perfect" JSON bar being unreachable for now. The only consistently reliable method I've seen in the wild is some form of constrained decoding or grammar enforcement—bit brittle, but practical. Sampling will always be fuzzy unless the architecture fundamentally shifts. Anyone claiming zero-validity issues is probably glossing over a ton of downstream QA work.

FlyingLawnmower•57m ago

I spent a couple years building a high performance, expressive library for structured outputs in LLMs. Our library is used by OpenAI for structured outputs on the hosted API. Happy to answer questions on how this works:

User friendly library that connects to lots of OSS model serving backends: https://github.com/guidance-ai/guidance/

Core Rust library written for high performance mask computation (written mostly by my collaborator @mmoskal): http://github.com/guidance-ai/llguidance

ninadpathak•45m ago

Guidance is genuinely impressive for anyone wrangling LLM output. The ability to map grammar constraints so efficiently at inference solves so many subtle issues—tokenization headaches being just one. Curious if you've benchmarked adoption for JSON vs. custom grammars among production teams? Anecdotally, JSON's become the baseline, but custom grammars unlock way more nuanced applications.

btown•35m ago

The LLGuidance paper is highly recommended reading for everyone interested in this! https://guidance-ai.github.io/llguidance/llg-go-brrr

TL;DR instead of just getting a token and seeing if it would be accepted by the parser, you can actually zero-out probabilities for all invalid tokens, and do the computation for this in parallel at effectively zero cost:

> Here, compute_mask() can run on the CPU during the time it would be normally just waiting for the GPU to finish. The line prob[~mask] = 0.0 would normally be fused into the softmax kernel in the last stage of the LLM, with negligible overhead. Therefore, as long as the compute_mask() function completes faster than the LLM forward pass and parser.consume() is negligible (typically follows from compute_mask() speed), the constrained generation will be as fast as the unconstrained one.

I'm curious - have there been any research/conversations about pushing masking even earlier in the pipeline? In theory, there's a fair amount of compute that goes into computing the probability of tokens that will end up being masked away anyways.

ijk•24m ago

I've been curious about grammar support for non-JSON applications. (i.e., I have some use cases where XML is more natural and easier to parse but Pydantic seems to assume you should only work with JSON.) Would guidance be able to handle this use case?

In general I find that matching the most natural format for a document outperforms waiting for the big model trainers to convince the model that the format you want is a valid structure, so anything that lets me interweave structured and unstructured generation is very interesting to me right now.

minimaxir•41m ago

Google's Gemini API is a bit odd with structured outputs. If you specify an Application/JSON response mimetype, it will reliably respond with a consistent JSON output without any prompt engineering shenanigans. For my workflows, this setting plus providing a JSON Schema in the system prompt works even with complex schema.

The Gemini API has a canonical implementation of structured outputs where you can instead pass the JSON schema as a separate parameter to control the grammar more closely. However, this setting will reorder the JSON schema fields to be alphabetical beforehand, which is especially not desired behavior as the order of JSON fields in a schema is often very deliberate to control generation.

Inviz•17m ago

gemini api has propertyOrdering field for that

MrBeast Failed to Disclose Ads and Improperly Collected Children's Data

Go has added Valgrind support

The War on Roommates: Why Is Sharing a House Illegal?

Abundant Intelligence

Getting More Strategic

Structured Outputs in LLMs

Nine Things I Learned in Ninety Years

AI won't use as much electricity as we are told

Zoxide: A Better CD Command

Agents turn simple keyword search into compelling search experiences

Zinc (YC W14) Is Hiring a Senior Back End Engineer (NYC)

Processing Strings 109x Faster Than Nvidia on H100

Altoids by the Fistful

Show HN: Run Qwen3-Next-80B on 8GB GPU at 1tok/2s throughput

How are developers using AI? Inside our 2025 DORA report

Qwen3-Omni: Native Omni AI model for text, image and video

Cache of Devices Capable of Crashing Cell Network Is Found Near U.N

The YAML Document from Hell

Compiling a Functional Language to LLVM (2023)

I built a dual RTX 3090 rig for local AI in 2025 (and lessons learned)

Fall Foliage Map 2025

Delete FROM users WHERE location = 'Iran';

Forking Styled Components

Indoor surfaces act as sponges for harmful chemicals

OrangePi 5 Ultra Review: An ARM64 SBC Powerhouse

Hyb Error: A Hybrid Metric Combining Absolute and Relative Errors

Secret Service takes down network that could have crippled New York cell service

Gamebooks and graph theory (2019)

Awash in revisionist histories about Apple's web efforts, a look at the evidence

Obscure feature and obscure feature and obscure feature = compiler bug

MrBeast Failed to Disclose Ads and Improperly Collected Children's Data

Go has added Valgrind support

The War on Roommates: Why Is Sharing a House Illegal?

Abundant Intelligence

Getting More Strategic

Structured Outputs in LLMs

Nine Things I Learned in Ninety Years

AI won't use as much electricity as we are told

Zoxide: A Better CD Command

Agents turn simple keyword search into compelling search experiences

Zinc (YC W14) Is Hiring a Senior Back End Engineer (NYC)

Processing Strings 109x Faster Than Nvidia on H100

Altoids by the Fistful

Show HN: Run Qwen3-Next-80B on 8GB GPU at 1tok/2s throughput

How are developers using AI? Inside our 2025 DORA report

Qwen3-Omni: Native Omni AI model for text, image and video

Cache of Devices Capable of Crashing Cell Network Is Found Near U.N

The YAML Document from Hell

Compiling a Functional Language to LLVM (2023)

I built a dual RTX 3090 rig for local AI in 2025 (and lessons learned)

Fall Foliage Map 2025

Delete FROM users WHERE location = 'Iran';

Forking Styled Components

Indoor surfaces act as sponges for harmful chemicals

OrangePi 5 Ultra Review: An ARM64 SBC Powerhouse

Hyb Error: A Hybrid Metric Combining Absolute and Relative Errors

Secret Service takes down network that could have crippled New York cell service

Gamebooks and graph theory (2019)

Awash in revisionist histories about Apple's web efforts, a look at the evidence

Obscure feature and obscure feature and obscure feature = compiler bug

Structured Outputs in LLMs

Comments