As to why providers don't give you a nice API, maybe it's hard to implement efficiently.
It's not too bad if inference is happening token by token and reverting to the CPU every time, but I understand high performance LLM inference uses speculative decoding, with a smaller model guessing multiple tokens in advance and the main model doing verification. Doing grammar constraints across multiple tokens is tougher, there's an exponential number of states that need precomputing.
So you'd need to think about putting the parser automaton onto the GPU/TPU and use it during inference without needing to stall a pipeline by going back CPU.
And then you start thinking about how big that automaton is going to be. How many states, pushdown stack. You're basically taking code from the API call and running it on your hardware. There's dragons here, around fair use, denial of service etc.
The AI companies believe that these kinds of grammar mistakes will be solved by improving the models. To build out tools for grammar constrained inference like this is to suggest, on some level, that GPT-N+1 won't magically solve the problem.
The deeper level is that it's not just simple grammar constraints. Constraining to JSON is a nice party trick, but it opens the door to further ideas. How about constraining to a programming language's grammar? Those are well defined, you just swap the JSON grammar file for the Java grammar file, job done.
We can go further: Why not use a language server to constrain not only the grammar but also the content? What variables and functions are in-scope is known, constraining a variable reference or function call to one of their names can be done with the same techique as grammar constraints. ("monitor-guided decoding", figured out back in 2023)
Entire classes of hallucination problems can be eliminated this way. The marketing writes itself; "Our AI is literally incapable of making the errors humans make!"
What many AI developers, firms, and especially their leaders find grating about this is the implication. That AI is fallible and has to be constrained.
Another such inconvenience is that while these techniques improve grammar they highlight semantic problems. The code is correct & compiles, it just does the wrong thing.
https://fireworks.ai/docs/structured-responses/structured-ou...
Why wouldn't we apply the mask immediately for the first sampling? Is this an optimization somehow, is masking expensive?
> is masking expensive?
It's not expensive per-se; A single element-wise multiplication of the output vector.
The real "expense" is that you need to prepare masks for every element of your grammar as they are expensive to recompute as needed; LLM tokens do not cleanly map onto elements of your grammar. (Consider JSON: LLM tokens often combine various special characters such as curly braces, colons, and quotes.)
This isn't that hard to compute, it's just more work to implement.
Other libraries work by essentially pre-computing all the masks for all possible generations, but of course you're restricted to working with simple grammars in this case (like a subset of regular expressions)
However, why not use a grammar that does not have invalid sentences, and from there convert to any grammar that you want?
With a 2nd pass you basically "condition" it on the text right above, hoping to get better semantic understanding.
One statement surprised me was that the author thinks "models over time will just be able to output JSON perfectly without the need for constraining over time."
I'm not sure how this conclusion was reached. "Perfectly" is a bar that probabilistic sampling cannot meet.
User friendly library that connects to lots of OSS model serving backends: https://github.com/guidance-ai/guidance/
Core Rust library written for high performance mask computation (written mostly by my collaborator @mmoskal): http://github.com/guidance-ai/llguidance
TL;DR instead of just getting a token and seeing if it would be accepted by the parser, you can actually zero-out probabilities for all invalid tokens, and do the computation for this in parallel at effectively zero cost:
> Here, compute_mask() can run on the CPU during the time it would be normally just waiting for the GPU to finish. The line prob[~mask] = 0.0 would normally be fused into the softmax kernel in the last stage of the LLM, with negligible overhead. Therefore, as long as the compute_mask() function completes faster than the LLM forward pass and parser.consume() is negligible (typically follows from compute_mask() speed), the constrained generation will be as fast as the unconstrained one.
I'm curious - have there been any research/conversations about pushing masking even earlier in the pipeline? In theory, there's a fair amount of compute that goes into computing the probability of tokens that will end up being masked away anyways.
In general I find that matching the most natural format for a document outperforms waiting for the big model trainers to convince the model that the format you want is a valid structure, so anything that lets me interweave structured and unstructured generation is very interesting to me right now.
The Gemini API has a canonical implementation of structured outputs where you can instead pass the JSON schema as a separate parameter to control the grammar more closely. However, this setting will reorder the JSON schema fields to be alphabetical beforehand, which is especially not desired behavior as the order of JSON fields in a schema is often very deliberate to control generation.
SamLeBarbare•3h ago
electroglyph•1h ago