LLM Structured Outputs Handbook

https://nanonets.com/cookbooks/structured-llm-outputs

377•vitaelabitur•3w ago

Comments

tehnub•3w ago

This is a nice guide. I especially like the masked decoding diagrams on this page https://nanonets.com/cookbooks/structured-llm-outputs/basic-....

edit: Somehow that link doesn't work... It's the diagram on the "constrained method" page

prats226•3w ago

One of the authors here, will checkout the diagram link.

Every commercial model provider is adding structured outputs so will keep updating the guide.

HanClinto•3w ago

This is a seriously beautiful guide. I really appreciate you putting this together! I especially love the tab-through animations on the various pages, and this is one of the best explanations that I've seen. I generally feel I understand grammar-constrained generation pretty well (I've merged a handful of contributions to the llama.cpp grammar implementation), and yet I still learned some insights from your illustrations -- thank you!

I'm also really glad that you're helping more people understand this feature, how it works, and how to use it effectively. I strongly believe that structured outputs are one of the most underrated features in LLM engines, and people should be using this feature more.

Constrained non-determinism means that we can reliably use LLMs as part of a larger pipeline or process (such as an agent with tool-calling) and we won't have failures due to syntax errors or erroneous "Sure! Here's your output formatted as JSON with no other text or preamble" messages thrown in.

Your LLM output might not be correct. But grammars ensure that your LLM output is at least _syntactically_ correct. It's not everything, but it's not nothing.

And especially if we want to get away from cloud deployments and run effective local models, grammars are an incredibly valuable piece of this. For practical examples, I often think of Jart's example in her simple LLM-based spam-filter running on a Raspberry Pi [0]:

> llamafile -m TinyLlama-1.1B-Chat-v1.0.f16.gguf \ > --grammar 'root ::= "yes" | "no"' --temp 0 -c 0 \ > --no-display-prompt --log-disable -p "<|user|> > Can you say for certain that the following email is spam? ...

Even though it's a super-tiny piece of hardware, by including a grammar that constrains the output to only ever be "yes" or "no" (it's impossible for the system to produce a different result), then she can use a super-small model on super-limited hardware, and it is still useful. It might not correctly identify spam, but it's never going to break for syntactic reasons, which gives a great boost to the usefulness of small, local models.

* [0]: https://justine.lol/matmul/

fragmede•3w ago

What does it do when the model wants to return something else, and what's better/worse about doing it in llamafile vs whatever wrapper that's calling it? How do I set retries? What if I want JSON and a range instead?

ekianjo•3w ago

There are no retries. The grammar enforces the output tokens accepted as part of llamacpp.

IanCal•3w ago

> What does it do when the model wants to return something else,

You can build that into your structure, same as you would for allowing error values to be returned from a system.

TomatoCo•3w ago

You can't do it as part of whatever's calling it because this changes the sampler. The grammar constraints what tokens the sampler is allowed to consider, only passing tokens that are valid by the grammar.

robot-wrangler•3w ago

> strongly believe that structured outputs are one of the most underrated features in LLM engines

Structured output is really the whole foundation of lots of our hopes and dreams. The JSONSchemaBench paper is fairly preoccupied with performance, but where it talks about quality/compliance, the "LM only" scores in the tables are pretty bad. This post highlights the on-going difficulty and confusion around doing a simple, necessary, and very routine task well.

Massaging small inputs into structured formats isn't really the point. It's about all the nontrivial cases central to MCP, tool-use, local or custom APIs. My favorite example of this is every tool-use tutorial that's pretending that "ping" accepts 2 arguments, but, it's actually more like 20 arguments with subtle gotchas. Do the tool-use demos that correctly work with 2 arguments actually work with 20? How many more retries might that take, and what does this change about the hardware and models we need for "basic" stuff?

If you had a JSON schema correctly and completely describing legal input for say ffmpeg, then the size and complexity of it would be approaching that of kubernetes schemas (where JSONBench compliance is only at .56). Can you maybe yolo generate a correct ffmpeg command without consulting any schema with SOTA models? Of course!, but that works well because ffmpeg is a well-documented tool with decades of examples floating around in the wild. What's the arg-count and type-complexity for that one important function/class in your in-house code base? For a less well-known use case or tool, if you want hallucination free and correct output, then you need structured output that works, because the alternative is rolling your own model trained on your stuff.

shmolyneaux•3w ago

Are there output formats that are more reliable (better adherence to the schema, easier to get parse-able output) or cheaper (fewer tokens) than JSON? YAML has its own problems and TOML isn't widely adopted, but they both seem like they would be easier to generate.

What have folks tried?

marquesine•3w ago

Yes, that's the purpose of TOON.

https://github.com/toon-format/toon

prats226•3w ago

Nice, it would be good idea to develop CFG for this as well so can embed it into all these constrained decoding libraries

koakuma-chan•3w ago

Is there evidence that LLMs adhere to this format better than to JSON? I doubt that.

iLoveOncall•3w ago

It is 100% guaranteed that they DON'T. Toon is 3 months old, it's not used by anyone, and it's therefore not in the training set of any model.

TheTaytay•3w ago

Their benchmarks compare it against other formats as input, not as output.

koakuma-chan•3w ago

Now it makes sense.

tlarkworthy•3w ago

I use regex to force an XML schema and then use a normal XML parser to decode.

XML is better for code, and for code parts in particular I enforce a cdata[[ part so there LLM is pretty free to do anything without escaping.

OpenAI API lets you do regex structured output and it's much better than JSON for code.

psadri•3w ago

Could you share some samples / pointers on how you do this?

tlarkworthy•3w ago

Yeah, this upsert_cell tool does it

https://observablehq.com/@tomlarkworthy/forking-agent#upsert...

format: { type: "grammar", syntax: "regex", definition: cellsRegex },

Where cellRegex is

cellsRegex = { const CELL_OPEN = String.raw`<cell>\s`;

  const INPUTS_BLOCK = String.raw`<inputs>.*<\/inputs>\s*`;

  const CODE_BLOCK = String.raw`<code><!\[CDATA\[[\s\S]*\]\]>\s*<\/code>\s*`;

  const CELL_CLOSE = String.raw`<\/cell>`;

  return "^(" + CELL_OPEN + INPUTS_BLOCK + CODE_BLOCK + CELL_CLOSE + ")*$";

}
And the extraction logic is here https://observablehq.com/@tomlarkworthy/robocoop-2#process
function process(content) { const doc = domParser.parseFromString( "<response>" + content + "</response>", "text/xml" ); const cells = [...doc.querySelectorAll("cell")]; return cells.map((cell) => { const inputsContent = cell.querySelector("inputs")?.textContent || ""; return { inputs: inputsContent.length > 0 ? inputsContent.split(",").map((s) => s.trim()) : [], code: (cell.querySelector("code")?.textContent || "").trim() }; }); }
BTW that agent is under development and not actually that good at programming. Its parent https://observablehq.com/@tomlarkworthy/robocoop-2 is actually very good at notebook programming

greiskul•3w ago

Just brainstorming. Human beings have trouble writing json, cause it is too annoying. Too strict. In my experience, for humans writing typescript is a lot better than writing json directly, even when the file is just a json object. It allows comments, it allows things like trailing commas which are better for readability.

So maybe an interesting file to have the LLM generate is instead of the final file, a program that creates the final file? Now there is the problem of security of course, the program the LLM generates would need to be sandboxed properly, and time constrained to prevent DOS attacks or explosive output sizes, not to mention the cpu usage of the final result, but quality wise, would it be better?

orbital-decay•3w ago

You should do your own evals specific to your case. In my evals XML outperforms JSON on every model for out of distribution tasks (i.e. not for JSON that was in the data).

kaaloo•3w ago

We're working on an agentic content transformation pipeline based on markdown with YAML metadata in the front matter. I'm a bit worried about the lack of tooling with respect to JSON payloads but then again it's not that hard to parse and then convert to JSON to validate against a schema.

max2•3w ago

Generating code that when ran generates JSON works well if you design builder functions thoughtfully. Takes fewer tokens too.

roywiggins•3w ago

> We use a lenient parser like ast.literal_eval instead of the standard json.loads(). It will handle outputs that deviate from strict JSON format. (single quotes, trailing commas, etc.)

A nitpick: that's probably a good idea and I've used it before, but that's not really a lenient json parser, it's a Python literal parser and they happen to be close enough that it's useful.

earth2mars•3w ago

BAML

prats226•3w ago

https://nanonets.com/cookbooks/structured-llm-outputs/uncons...

dfajgljsldkjag•3w ago

I agree that building agents is basically impossible if you cannot trust the model to output valid json every time. This seems like a decent collection of the current techniques we have to force deterministic structure for production systems.

FlyingLawnmower•3w ago

Very nicely written guide!

If the authors or readers are interested in some of the more technical details of how we optimized guidance & llguidance, we wrote up a little paper about it here: https://guidance-ai.github.io/llguidance/llg-go-brrr

vitaelabitur•3w ago

One of the authors here. I've read the paper. Brilliant work, especially the slicing implementation for denser token masks.

fabiensanglard•3w ago

What would be the point of outputting unconstrained json if the output is consumed by a human?

mcyc•3w ago

This is a fantastic guide! I did a lot of work on structured generation for my PhD. Here are a few other pointers for people who might be interested:

Some libraries:

- Outlines, a nice library for structured generation

  - https://github.com/dottxt-ai/outlines

- Guidance (already covered by FlyingLawnmower in this thread), another nice library

  - https://github.com/guidance-ai/guidance

- XGrammar, a less-featureful but really well optimized constrained generation library

  - https://github.com/mlc-ai/xgrammar

  - This one has a lot of cool technical aspects that make it an interesting project

Some papers:

- Efficient Guided Generation for Large Language Models

  - By the outlines authors, probably the first real LLM constrained generation paper

  - https://arxiv.org/abs/2307.09702

- Automata-based constraints for language model decoding

  - A much more technical paper about constrained generation and implementation

  - https://arxiv.org/abs/2407.08103

- Pitfalls, Subtleties, and Techniques in Automata-Based Subword-Level Constrained Generation

  - A bit of self-promotion. We show where constrained generation can go wrong and discuss some techniques for the practitioner

  - https://openreview.net/pdf?id=DFybOGeGDS

Some blog posts:

- Fast, High-Fidelity LLM Decoding with Regex Constraints

  - Discusses adhering to the canonical tokenization (i.e., not just the constraint, but also what would be produced by the tokenizer)

  - https://vivien000.github.io/blog/journal/llm-decoding-with-regex-constraints.html

- Coalescence: making LLM inference 5x faster

  - Also from the outlines team

  - This is about skipping inference during constrained generation if you know there is only one valid token (common in the canonical tokenization setting)

  - https://blog.dottxt.ai/coalescence.html

reactordev•3w ago

What a gold mine!

Automata-based constraints is fun.

anonymoushn•3w ago

Hello, the part about canonical filtering in https://openreview.net/pdf?id=DFybOGeGDS doesn't seem to try to account for pretokenization. For example, if you receive " 天天中彩票APP" in o200k, it means there has to be a lowercase letter within the span of letters, and while tokens like (4 spaces) may be pairwise compatible with tokens like "123" according to the BPE merge rules, the pretokenizer would split the span of spaces to give (3 spaces), " ", "123" instead. Are you aware of any work that does actual canonical generation for models with this kind of pretokenization regex?

iLoveOncall•3w ago

> Here are a few other pointers

Proceeds to list all the libraries already listed in the guide.

crashabr•3w ago

I've never fully understood where Outlines fit in the stack. Is it a way to create a structured output API similar to the ones big providers have? Have you looked at something like BAML?

Imanari•3w ago

I like structured outputs as much as the next guy but be careful not to try to structure natural language.

maxdo•3w ago

Huge fan of BAML , nice coverage

kylecazar•3w ago

This information is really presented well. I subscribed to your newsletter. Thanks!

bandrami•3w ago

These are cool tricks but this seems like an impedence mismatch: why would you use an LLM (a probabilistic source of plausible text) in a situation where you want a deterministic source of text where plausibility is not enough?

orbital-decay•3w ago

You... don't. That's exactly what structured outputs are for! You're offloading any formally defined generation to a tool that better serves the case, leaving the ambiguous part of the task to the model.

Code is an example of a mixed case. Getting any mechanistically parsable output from a model is another. Sure, you can format it after the generation, but you still need the generation to be parsable for that. In many cases, using the required format right away will also provide the context for better replies.

xboxnolifes•3w ago

Because of their ability to handle unstructured input well.

hansvm•3w ago

This is good. It covers the two easiest dominant methods people use. It even touches on my main complaint for the one they seem to recommend.

That said:

- Constrained generation yields a different distribution from what a raw LLM would provide. This can be pathologically bad. My go-to example is LLMs having a preference for including ellipses in long, structured objects. Constrained generation forces closing quotes or whatever it takes to recover from that error according to a schema, nevertheless yielding an invalid result. Resampling tends to repeat till the LLM fully generates the data in question, always yielding a valid result which also adheres to the schema. It can get much worse than that.

- The unconstrained "method" has a few possible implementations. Increasing context length by complaining about schema errors is almost always worse from an end quality perspective than just retrying till the schema passes. Effective context windows are precious, and current models bias heavily toward earlier data which has been fed into them. In a low-error regime you might get away with a "try it again" response in a single chat, but in a high-error regime you'll get better results at a lower cost by literally re-sending the same prompt till the model doesn't cause errors.

vitaelabitur•3w ago

> Increasing context length by complaining about schema errors is almost always worse from an end quality perspective than just retrying till the schema passes.

Another way to do this is to use a hybrid approach. You perform unconstrained generation first, and then constrained generation on the failures.

hansvm•3w ago

There's no difference in the output distribution between always doing constrained generation and only doing it on the failures though. What's the advantage?

vitaelabitur•3w ago

There's no advantage wrt output quality, but it can be more economical in some high-error regimes, with less LLM calls used in resampling (max 2 for most errors).

hansvm•3w ago

My point is that if you're capable of doing constrained generation and want to try once and the constrain on failure, since that has the same output distribution as doing constrained generation in the first place, you'd be better off just doing constrained generation always (max of 1 LLM call for the class of errors fixed by this).

There's only a different distribution with 2+ initial attempts before falling back to constrained, at least if I haven't screwed up any math.

bjt12345•2w ago

Regarding your first point, this makes me wonder if diffusion models will be the future of constrained decoding.

hansvm•2w ago

Perhaps. Would you mind elaborating on what you're envisioning?

In both cases (auto-regressive vs diffusive), you still have some process that's being followed, and the exact steps in the process are important to the result. If you constraint at each step then you get the equivalent of something like projected gradient descent (as an analogy) and aren't guaranteed the same solution. If you constrain as a post-processing phase then (a) diffusion wasn't required for the initial generation, and (b) that's still unlikely to converge to the same distribution (for similar reasons -- using my example of ellipsis errors, if you corrected that particular mistake in post then the closest valid messages to the initial generation are likely too short and thus still incorrect).

libraryofbabel•3w ago

Question for the well-informed people reading this thread: do SoTA models like Opus, Gemini and friends actually need output schema enforcement still, or has all the the RLVR training they do on generating code and json etc. made schema errors vanishingly unlikely? Because as a user of those models, they almost never make syntax mistakes in generating json and code; perhaps they still do output schema enforcement for "internal" things like tool call schemas though? I would just be surprised if it was actually catching that many errors. Maybe once in a while; LLMs are probabilistic after all.

(I get why you need structured generation for smaller LLMs, that makes sense.)

kleton•3w ago

Yes. Most common failure mode for sota models is to put ```json\n first, but they often do just fail often enough to be worth calling api with json response schema.

XenophileJKO•3w ago

1000% I was just doing some spot checking of GPT-5.2 for evaluating model migration and the tool I used didn't have the setup to use schema constrained inference.

The model is like: "Here is what I came up with... ```{json}``` and this is why I am proud of it!"

runeblaze•3w ago

Schemas can get pretty complex (and LLMs might not be the best at counting). Also schemas are sometimes the first way to guard against the stochasticity of LLMs.

With that said, the model is pretty good at it.

ineedasername•3w ago

This is going to be task-dependent, as well as limited by your (the implementer's) ability and comfort with structuring the task in solid multi-shot prompts that cover a large distribution of expected inputs, which will also help increase the ability for the model to successfully handle less common or edge case inputs-- the ones the would most typically require human-level reasoning. It can be useful to supplement this with a "tool" use for RAG lookup against a more extensive store of examples, or any time the full reference material isn't practical to dump into context. This requires thoughtful chunking.

It also requires testing. Don't think of it as a magic machine that should be able to do anything, think of it like a new employee smart enough and with enough background knowledge to do the task, if given proper job documentation. Test whether few-shot or many shot prompting works better: there's growing information about use cases where one or the other confers an advantage but so much of this is task dependent.

Consider your tolerance for errors and plan some escalation method: Hallucinations occur in part because models "have to" give an answer. Make sure that any critical cases where an error would be problematic have some way for the model to bail out with "i don't know" for human review. The first layer of escalation doesn't even have to be a human, it could be a separate model, eg Opus instead of Sonnet, or the same model but with a different setup prompt explicitly designed for handling certain cases without cluttering up context of the first one. Splitting things in this way, if there's a logical break point, is also a great way to save on token cost: If you can send only 10k of tokens in a system prompt instead of 50k and just choose which of 5 10k prompts to use for different cases then you save 80% of upstream token $$.

Consider running the model deterministic: 0 temp, same seed. It makes any errors you encounter easier to trace and debug.

Something to consider with respect to cost though: Many tasks that a SoTA can do with very little or no scaffolding can be done with these cheaper models and may not take much more scaffolding. If a SoTA giving reliable responses with zero shot prompting there's a decent chance you can save a ton of money with a flash model if you provide it one or few shot prompts. Open weight models even more so.

My anecdotal experience is that open models like Google's gemma and OpenAI's gpt-oss have behaviors more similar to their paid counterparts than other open models, making them reasonable candidates to try if you're getting good results from the paid models but they're perhaps overkill for the task.

msp26•3w ago

Incredible guide, wow. Will definitely share with people. I wish I had something like this a year ago.

iLoveOncall•3w ago

Stupid question but isn't this useless for 99% of users? By that I mean that either your API provider supports Structured Outputs (OpenAI and Google) or it doesn't and you're SOL.

Sure the guide presents some alternatives but they're incomparably useless VS real enforced structured output.

I get that some people will run their own models or whatever and will be able to use some of the other techniques, but that's the remaining 1%.

iamflimflam1•3w ago

Well, you still need to decide if structured output is the right choice.

As they point out - this might impact results where deep reasoning is required.

So you might be better off taking the unconstrained approach with feedback.

iLoveOncall•3w ago

The only "solution" with the unconstrained approach is to ask the LLM to regenerate the JSON. This is definitely more expensive than whatever downside from requesting structured outputs from the API.

ESPECIALLY with situations where deep reasoning is required, since those are likely to correlate with longer JSON outputs and therefore more failure points.

iamflimflam1•3w ago

Definitely. A lot of what is missing in many discussions is the absolutely essential need to have evals.

The only way to “know” what is the best (or better) approach is to have a significant number of test cases that you can measure performance against.

At the moment, for a lot of people, state of the art is “let’s try a different prompt and see if the answer on my one example is better”

meta-level•3w ago

The first thing I've seen is that the article uses https://xkcd.com/2347/ without a reference.. Is it famous enough to be sure everybody knows the origin?

davedx•3w ago

I've built pipelines with lab provided structured outputs and without, one thing to be aware of is enforcing structured outputs has a performance penalty.

That might not matter to you, but it can be 2-3x slower sometimes.

darkamaul•3w ago

Curious what tech stack is behind this docs/cookbook page. Doesn't look like standard MkDocs/GitBook, but maybe I'm wrong.

Would love to know.

vitaelabitur•3w ago

We used Docusaurus.

farhankhan3•3w ago

BAML (https://boundaryml.com/) is really great for handling structured outputs. It's what we use at my company.

ilamparithi•3w ago

For my use case (https://www.grokvocab.com/), I get proper JSON output without much effort. I am using Langchain4J (https://github.com/langchain4j/langchain4j) which automatically maps the output JSON to my POJO. I just prompt the model to return the output as JSON.

bjt12345•3w ago

To use constrained decoding like this, you really need to use an open weight model or can this be done using OpenAI, Claude, Gemini...etc?

Tiny C Compiler

SectorC: A C Compiler in 512 bytes

The F Word

Speed up responses with fast mode

GitBlack: Tracing America's Foundation

Software factories and the agentic moment

FDA intends to take action against non-FDA-approved GLP-1 drugs

Hoot: Scheme on WebAssembly

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

First Proof

Al Lowe on model trains, funny deaths and working with Disney

Show HN: A luma dependent chroma compression algorithm (image compression)

I write games in C (yes, C) (2016)

Vocal Guide – belt sing without killing yourself

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Start all of your commands with a comma (2009)

LLMs as the new high level language

Reinforcement Learning from Human Feedback

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Selection rather than prediction

The AI boom is causing shortages everywhere else

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

72M Points of Interest

Coding agents have replaced every framework I used

Unseen Footage of Atari Battlezone Arcade Cabinet Production

A Fresh Look at IBM 3270 Information Display System

France's homegrown open source online office suite

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Where did all the starships go?

Tiny C Compiler

SectorC: A C Compiler in 512 bytes

The F Word

Speed up responses with fast mode

GitBlack: Tracing America's Foundation

Software factories and the agentic moment

FDA intends to take action against non-FDA-approved GLP-1 drugs

Hoot: Scheme on WebAssembly

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

First Proof

Al Lowe on model trains, funny deaths and working with Disney

Show HN: A luma dependent chroma compression algorithm (image compression)

I write games in C (yes, C) (2016)

Vocal Guide – belt sing without killing yourself

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Start all of your commands with a comma (2009)

LLMs as the new high level language

Reinforcement Learning from Human Feedback

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Selection rather than prediction

The AI boom is causing shortages everywhere else

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

72M Points of Interest

Coding agents have replaced every framework I used

Unseen Footage of Atari Battlezone Arcade Cabinet Production

A Fresh Look at IBM 3270 Information Display System

France's homegrown open source online office suite

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Where did all the starships go?

LLM Structured Outputs Handbook

Comments