I'm also really glad that you're helping more people understand this feature, how it works, and how to use it effectively. I strongly believe that structured outputs are one of the most underrated features in LLM engines, and people should be using this feature more.
Constrained non-determinism means that we can reliably use LLMs as part of a larger pipeline or process (such as an agent with tool-calling) and we won't have failures due to syntax errors or erroneous "Sure! Here's your output formatted as JSON with no other text or preamble" messages thrown in.
Your LLM output might not be correct. But grammars ensure that your LLM output is at least _syntactically_ correct. It's not everything, but it's not nothing.
And especially if we want to get away from cloud deployments and run effective local models, grammars are an incredibly valuable piece of this. For practical examples, I often think of Jart's example in her simple LLM-based spam-filter running on a Raspberry Pi [0]:
> llamafile -m TinyLlama-1.1B-Chat-v1.0.f16.gguf \ > --grammar 'root ::= "yes" | "no"' --temp 0 -c 0 \ > --no-display-prompt --log-disable -p "<|user|> > Can you say for certain that the following email is spam? ...
Even though it's a super-tiny piece of hardware, by including a grammar that constrains the output to only ever be "yes" or "no" (it's impossible for the system to produce a different result), then she can use a super-small model on super-limited hardware, and it is still useful. It might not correctly identify spam, but it's never going to break for syntactic reasons, which gives a great boost to the usefulness of small, local models.
* [0]: https://justine.lol/matmul/
You can build that into your structure, same as you would for allowing error values to be returned from a system.
Structured output is really the whole foundation of lots of our hopes and dreams. The JSONSchemaBench paper is fairly preoccupied with performance, but where it talks about quality/compliance, the "LM only" scores in the tables are pretty bad. This post highlights the on-going difficulty and confusion around doing a simple, necessary, and very routine task well.
Massaging small inputs into structured formats isn't really the point. It's about all the nontrivial cases central to MCP, tool-use, local or custom APIs. My favorite example of this is every tool-use tutorial that's pretending that "ping" accepts 2 arguments, but, it's actually more like 20 arguments with subtle gotchas. Do the tool-use demos that correctly work with 2 arguments actually work with 20? How many more retries might that take, and what does this change about the hardware and models we need for "basic" stuff?
If you had a JSON schema correctly and completely describing legal input for say ffmpeg, then the size and complexity of it would be approaching that of kubernetes schemas (where JSONBench compliance is only at .56). Can you maybe yolo generate a correct ffmpeg command without consulting any schema with SOTA models? Of course!, but that works well because ffmpeg is a well-documented tool with decades of examples floating around in the wild. What's the arg-count and type-complexity for that one important function/class in your in-house code base? For a less well-known use case or tool, if you want hallucination free and correct output, then you need structured output that works, because the alternative is rolling your own model trained on your stuff.
What have folks tried?
XML is better for code, and for code parts in particular I enforce a cdata[[ part so there LLM is pretty free to do anything without escaping.
OpenAI API lets you do regex structured output and it's much better than JSON for code.
https://observablehq.com/@tomlarkworthy/forking-agent#upsert...
format: { type: "grammar", syntax: "regex", definition: cellsRegex },
Where cellRegex is
cellsRegex = { const CELL_OPEN = String.raw`<cell>\s`;
const INPUTS_BLOCK = String.raw`<inputs>.*<\/inputs>\s*`;
const CODE_BLOCK = String.raw`<code><!\[CDATA\[[\s\S]*\]\]>\s*<\/code>\s*`;
const CELL_CLOSE = String.raw`<\/cell>`;
return "^(" + CELL_OPEN + INPUTS_BLOCK + CODE_BLOCK + CELL_CLOSE + ")*$";
}And the extraction logic is here https://observablehq.com/@tomlarkworthy/robocoop-2#process
function process(content) { const doc = domParser.parseFromString( "<response>" + content + "</response>", "text/xml" ); const cells = [...doc.querySelectorAll("cell")]; return cells.map((cell) => { const inputsContent = cell.querySelector("inputs")?.textContent || ""; return { inputs: inputsContent.length > 0 ? inputsContent.split(",").map((s) => s.trim()) : [], code: (cell.querySelector("code")?.textContent || "").trim() }; }); }
BTW that agent is under development and not actually that good at programming. Its parent https://observablehq.com/@tomlarkworthy/robocoop-2 is actually very good at notebook programming
So maybe an interesting file to have the LLM generate is instead of the final file, a program that creates the final file? Now there is the problem of security of course, the program the LLM generates would need to be sandboxed properly, and time constrained to prevent DOS attacks or explosive output sizes, not to mention the cpu usage of the final result, but quality wise, would it be better?
A nitpick: that's probably a good idea and I've used it before, but that's not really a lenient json parser, it's a Python literal parser and they happen to be close enough that it's useful.
If the authors or readers are interested in some of the more technical details of how we optimized guidance & llguidance, we wrote up a little paper about it here: https://guidance-ai.github.io/llguidance/llg-go-brrr
Some libraries:
- Outlines, a nice library for structured generation
- https://github.com/dottxt-ai/outlines
- Guidance (already covered by FlyingLawnmower in this thread), another nice library - https://github.com/guidance-ai/guidance
- XGrammar, a less-featureful but really well optimized constrained generation library - https://github.com/mlc-ai/xgrammar
- This one has a lot of cool technical aspects that make it an interesting project
Some papers:- Efficient Guided Generation for Large Language Models
- By the outlines authors, probably the first real LLM constrained generation paper
- https://arxiv.org/abs/2307.09702
- Automata-based constraints for language model decoding - A much more technical paper about constrained generation and implementation
- https://arxiv.org/abs/2407.08103
- Pitfalls, Subtleties, and Techniques in Automata-Based Subword-Level Constrained Generation - A bit of self-promotion. We show where constrained generation can go wrong and discuss some techniques for the practitioner
- https://openreview.net/pdf?id=DFybOGeGDS
Some blog posts:- Fast, High-Fidelity LLM Decoding with Regex Constraints
- Discusses adhering to the canonical tokenization (i.e., not just the constraint, but also what would be produced by the tokenizer)
- https://vivien000.github.io/blog/journal/llm-decoding-with-regex-constraints.html
- Coalescence: making LLM inference 5x faster - Also from the outlines team
- This is about skipping inference during constrained generation if you know there is only one valid token (common in the canonical tokenization setting)
- https://blog.dottxt.ai/coalescence.htmlAutomata-based constraints is fun.
Proceeds to list all the libraries already listed in the guide.
Code is an example of a mixed case. Getting any mechanistically parsable output from a model is another. Sure, you can format it after the generation, but you still need the generation to be parsable for that. In many cases, using the required format right away will also provide the context for better replies.
That said:
- Constrained generation yields a different distribution from what a raw LLM would provide. This can be pathologically bad. My go-to example is LLMs having a preference for including ellipses in long, structured objects. Constrained generation forces closing quotes or whatever it takes to recover from that error according to a schema, nevertheless yielding an invalid result. Resampling tends to repeat till the LLM fully generates the data in question, always yielding a valid result which also adheres to the schema. It can get much worse than that.
- The unconstrained "method" has a few possible implementations. Increasing context length by complaining about schema errors is almost always worse from an end quality perspective than just retrying till the schema passes. Effective context windows are precious, and current models bias heavily toward earlier data which has been fed into them. In a low-error regime you might get away with a "try it again" response in a single chat, but in a high-error regime you'll get better results at a lower cost by literally re-sending the same prompt till the model doesn't cause errors.
Another way to do this is to use a hybrid approach. You perform unconstrained generation first, and then constrained generation on the failures.
There's only a different distribution with 2+ initial attempts before falling back to constrained, at least if I haven't screwed up any math.
In both cases (auto-regressive vs diffusive), you still have some process that's being followed, and the exact steps in the process are important to the result. If you constraint at each step then you get the equivalent of something like projected gradient descent (as an analogy) and aren't guaranteed the same solution. If you constrain as a post-processing phase then (a) diffusion wasn't required for the initial generation, and (b) that's still unlikely to converge to the same distribution (for similar reasons -- using my example of ellipsis errors, if you corrected that particular mistake in post then the closest valid messages to the initial generation are likely too short and thus still incorrect).
(I get why you need structured generation for smaller LLMs, that makes sense.)
The model is like: "Here is what I came up with... ```{json}``` and this is why I am proud of it!"
With that said, the model is pretty good at it.
It also requires testing. Don't think of it as a magic machine that should be able to do anything, think of it like a new employee smart enough and with enough background knowledge to do the task, if given proper job documentation. Test whether few-shot or many shot prompting works better: there's growing information about use cases where one or the other confers an advantage but so much of this is task dependent.
Consider your tolerance for errors and plan some escalation method: Hallucinations occur in part because models "have to" give an answer. Make sure that any critical cases where an error would be problematic have some way for the model to bail out with "i don't know" for human review. The first layer of escalation doesn't even have to be a human, it could be a separate model, eg Opus instead of Sonnet, or the same model but with a different setup prompt explicitly designed for handling certain cases without cluttering up context of the first one. Splitting things in this way, if there's a logical break point, is also a great way to save on token cost: If you can send only 10k of tokens in a system prompt instead of 50k and just choose which of 5 10k prompts to use for different cases then you save 80% of upstream token $$.
Consider running the model deterministic: 0 temp, same seed. It makes any errors you encounter easier to trace and debug.
Something to consider with respect to cost though: Many tasks that a SoTA can do with very little or no scaffolding can be done with these cheaper models and may not take much more scaffolding. If a SoTA giving reliable responses with zero shot prompting there's a decent chance you can save a ton of money with a flash model if you provide it one or few shot prompts. Open weight models even more so.
My anecdotal experience is that open models like Google's gemma and OpenAI's gpt-oss have behaviors more similar to their paid counterparts than other open models, making them reasonable candidates to try if you're getting good results from the paid models but they're perhaps overkill for the task.
Sure the guide presents some alternatives but they're incomparably useless VS real enforced structured output.
I get that some people will run their own models or whatever and will be able to use some of the other techniques, but that's the remaining 1%.
As they point out - this might impact results where deep reasoning is required.
So you might be better off taking the unconstrained approach with feedback.
ESPECIALLY with situations where deep reasoning is required, since those are likely to correlate with longer JSON outputs and therefore more failure points.
The only way to “know” what is the best (or better) approach is to have a significant number of test cases that you can measure performance against.
At the moment, for a lot of people, state of the art is “let’s try a different prompt and see if the answer on my one example is better”
That might not matter to you, but it can be 2-3x slower sometimes.
Would love to know.
tehnub•3w ago
edit: Somehow that link doesn't work... It's the diagram on the "constrained method" page
prats226•3w ago
Every commercial model provider is adding structured outputs so will keep updating the guide.