Yes, it's a thing that works.
There are an enormous number of use cases where the prompt is large and the expected output is small.
E.g. providing data for the LLM to analyze, after which it gives a simple yes/no Boolean response. Or selecting a single enum value from a set.
This pattern seems far more valuable in practice, than the common and lazy open ended chat style implementations (lazy from a product perspective).
Obviously decode will be important for code generation or search, but that's such a small set of possible applications, and you'll probably always do better being on the latest models in the cloud.
Now I'm trying to stop myself from finding an excuse to spend upwards of $30k on compute hardware...
Reading the article I wished for a device that just does both things well and on that topic it might be noteworthy that Apple's just-released M5 has approximately 3.5x-ed TTFT performance compared to M4, according to their claims!
pram•4h ago