This line from the abstract got me really suspicious. Obviously a compression scheme that incorporates the entire sequence shouldn't get worse compared to a per element one as the length increases.
It is important to note that this paper is PURELY theoretical. I couldn't find much meat on the bone from a quick skim.
The single author, Gregory Magarshak, has only published one paper on arxiv before and appears to be a professor of business / music. I don't plan to give it more of a read hoping for something of value.
The author is not an ML researcher but rather an AI startup CTO / founder. Previously worked on “social operating systems” for the web, blockchain of course. And now an AI innovator. I’m suspicious. This was part of the author’s reply in another thread:
> When TurboQuant came out, I realized we can also go way below the Shannon limit in the same way, and take advantage of PLT. In fact, I'm working on publishing a paper that generalizes this to robotics (which needs to do cheap fast on-board inference "in the field"). I also believe this is how animals actually learn. In other words, over time they learn overall "sequences" of actions and then can check whether they are "good enough" to solve the problem, or whether to switch to a full analysis -- this corresponds to System 1 and 2 of Daniel Kahneman's "Thinking Fast and Slow".
Which doesn’t exactly inspire confidence and makes me wonder who they think their audience is. ML researchers or LinkedIn.
Treating trained weights as random for the purpose of a proof is immediately discrediting for a paper to me.
However, I do have a long history of diving deep into fields and building practical, open-source solutions to major problems I perceive in the fields.
15 years ago I started with social networks and PHP: https://github.com/Qbix http://laweekly.com/restoring-healthy-communities/
8 years ago I got into smart contracts on EVM, which was the SOTA at the time: https://github.com/Intercoin https://intercoin.org/applications
About a year and a half ago I started teaching a course on AI at a university not far from NYU where I studied... and that's what got me into this: https://vimeo.com/1063008765/c7ef3abcc5
I try to document everything on GitHub and popular articles, but only recently started publishing academic papers on arXiv and plan to actually start submitting them for real publications. While I build, I realized that I should start publishing any novel theoretical results that underpin my work.
I plan to publish actual code in a few weeks. To be fair, TurboQuant is also a purely theoretical paper. I just wanted to get this out and share.
TurboQuant is not a purely theoretical paper. Section 4 "Experiments" (page 15) [0] has a bunch of figure based on actual GPU computations.
Contrast that with your paper: no experiments, no implementation, no empirical validation of any kind.
Did you try engaging with LLM researchers and get their feedback on your paper?
I don't understand this. The key and value vectors for any given layer + token are created by the model. By definition, they are exactly equal to the model's prediction of them!
Extreme KV cache compression is easy to get---you can get an infinite compression ratio by just regenerating the key and value vectors on every forward pass. The point of a KV cache is to reduce the amount of repeated computation during generation, though. Compression only helps if you have an efficient decompression algorithm.
Attention really is all you need.
tomrod•1h ago
EGreg•1h ago
This is based on my original "PLT" paper: Probablistic Language Tries (https://news.ycombinator.com/item?id=47743585). A "Trie" is basically a tree of prefixes. While working on https://safebots.ai I became obsessed with caching generated artifacts as a means to do a lot of things: extremely cheap inference, near-optimal compression, modeling decision trees for strategies, and so on.
The PLT model was about compression in general. My main insight there was that the LLM's own weights actually contain an incredibly detailed probability distribution of "the next token" in any sequence, which can therefore be very useful to supercharge statistical compression. Sequences which occur frequently in the domain of the model receive short codes. The other insight is that if we allowed lossy compression, we could compress well below the Shannon information limit, and just have an "overflow" bag for surprising sequences.
When TurboQuant came out, I realized we can also go way below the Shannon limit in the same way, and take advantage of PLT. In fact, I'm working on publishing a paper that generalizes this to robotics (which needs to do cheap fast on-board inference "in the field"). I also believe this is how animals actually learn. In other words, over time they learn overall "sequences" of actions and then can check whether they are "good enough" to solve the problem, or whether to switch to a full analysis -- this corresponds to System 1 and 2 of Daniel Kahneman's "Thinking Fast and Slow".
If you want more specific information, or see the code for a working prototype, you can write me at the email in the paper.
Rekindle8090•59m ago
EGreg•58m ago
cristoperb•51m ago
But it does get me confused sometimes because in LaTeX (and other markup languages) -- gets converted to an endash whereas it takes three hyphens --- to make an emdash.
rhet0rica•38m ago
you have 5 seconds to comply before your planet will be demolished to make room for a giant space-typographer's punctuation case
stingraycharles•52m ago
wholinator2•41m ago
usernametaken29•47m ago
himata4113•40m ago
I was incredibly curious since I had a pet theory in my mind about something extremely similar, but arrived at a conclusion that the time complexity of such cache would end up being extremely slow.
This is like saying that you've achieved single token compression when you're passing a single token into a model and letting it regenerate the entire output since at the end of the day models are probabilistic stateless devices. At that point you don't have a cache and are just replaying the tokens or have a caching algorithm with a complexity similar to that of a model defeating the purpose of such cache.
I've never considered that arXiv had a problem, now I do.
EGreg•32m ago
On complexity, that's fair concern, and the paper doesn't fully resolve it. But the analogy to "replaying tokens through the model" isn't exactly right. The delta coding layer uses the model's own next-token prediction, which is already happening during normal autoregressive inference. You're not adding a forward pass, you're using the one already running and storing only the residual, which is much smaller than the raw vector -- precisely because the model is a good predictor of its own next state.
The trie index lookup is O(sequence length), not O(model forward pass). Whether that's fast enough in practice at scale is actually a legitimate open question and I'd be the first to admit the paper doesn't settle it. But the contribution here is simply establishing that the bound exists and is dramatically lower than what the field has been targeting. That's what I wanted to put out. The engineering question of how close you can get is the natural next step.
Your pet theory about time complexity sounds interesting actually, did you write it up anywhere?
mbernstein•34m ago
EGreg•26m ago
And yes, it's a compute/memory tradeoff, all caching is. The claim is just that the memory floor is much lower than anyone had formally established. Whether the compute cost of getting there is worth it is a fair open question the paper doesn't settle. But what if it is? Caching is the thread running through most of my work, and I intend to find out.
gaze•56m ago