Current approaches require fancy tricks to fit tokens into memory, and spread attention thinner over larger numbers of tokens. The new approach tries to find a way to keep everything in a single shared memory, and process the tokens in parallel using multiple GPUs
Models aren't trained across their context, their context is their short term memory at runtime, right? Nothing to do with training. They are trained on a static dataset.
I noticed that the longer a chat gets, the more unpredictable the models behavior becomes (and I think that's still a common jailbreak technique too).
(I think it might also have something to do with RoPE, but that's beyond me.)
or lets say it differently: The LLM gets trained on static data but also on the capability of handling context in itself.
Kimi introduced this https://github.com/MoonshotAI/Attention-Residuals but i'm pretty sure closed labs like Google had something like this for a while.
I know Yann LeCun is trying to do a completely different architecture and I think that's expected to take 2-3 years before showing commercial results, right? Is that why they're finding it quicker to change the hardware?
People, Researcher, Investor etc. probably also want to see what would be possible and someone has to do it.
I can also imagine, that an inferencing optimized system like this could split the context for different requests if it doesn't need to use the full context.
Could also be that they have internal use cases which require this amount of context.
schnitzelstoat•1h ago
withinboredom•1h ago
Weryj•56m ago
But maybe that’s enough tokens to feed an entire lifetime of user behaviour in for the digital twin dystopia?
withinboredom•53m ago
AureliusMA•52m ago
faangguyindia•27m ago
AntiUSAbah•25m ago
Nvidia uses ML for finetuning and architecturing their chips. this might be one use case.
Another one would be to put EVERYTHING from your company into this context window. It would be easier to create 'THE' model for every company or person. It might also be saver than having a model train with your data because you don't have a model with all your data, only memory.