It seems OAI was forced by investors to shift quickly to making money. Anthropic seem to have more time? Might be hard for OAI to keep the pace while focusing on cost
Another possible explanation is speculative decoding, where you trade unused GPU memory for speed (via a drafting model).
But my money is on the exact two mechanisms the OP proposes.
It is worth noting that consumers are completely and totally incapable of detecting quality degradation with any accuracy. Which is a given since the models are already effectively random, but there is a strong bent to hallucinate degradations. Having done frontend work for an AI startup, complaints of degrading the model were by far the most common, despite the fact that not only did our model not change, users could easily verify that it didn't change because we expose seeds. A significant portion of complainers continue to complain about model degradation even when shown they could regenerate from the same seed+input and get the exact same output. Humans, at scale, are essentially incapable of comprehending the concept of randomness.
The real reason which batching increases latency is multi-factored and more complex to explain.
When an author is confused about something so elementary, I can’t trust anything else they write.
Inference is memory-bound only at low batch sizes. At high batch sizes it becomes compute-bound. There's a certain threshold where stuffing more requests in a batch will slow down every request in isolation even though it may still increase the number of tokens/second across the whole batch for all request in aggregate.
My personal take is that they will need a big model to plan and break down tasks and schedule them to specialized smaller models while there is a good enough model for real time interactions with the user, but it is the naive take and many other things might be shaping the decisions.
Seems like nonsense to me.
criemen•57m ago