Something like 1.61B just doesn't mean much to me since I don't know much about the guts of LLMs. But I'm curious about how that translates to computer hardware -- what specs would I need to run these? What could I run now, what would require spending some money, and what I might hope to be able to run in a decade?
In practice, models can be quantized to smaller weights for inference. Usually, the performance loss going from 16 bit weights to 8 bit weights is very minor, so a 1 billion parameter model can take 1 gigabyte. Thinking about these models in terms of 8-bit quantized weights has the added benefit of making the math really easy. A 20B model needs 20G of memory. Simple.
Of course, models can be quantized down even further, at greater cost of inference quality. Depending on what you're doing, 5-bit weights or even lower might be perfectly acceptable. There's some indication that models that have been trained on lower bit weights might perform better than larger models that have been quantized down. For example, a model that was trained using 4-bit weights might perform better than a model that was trained at 16 bits, then quantized down to 4 bits.
When running models, a lot of the performance bottleneck is memory bandwidth. This is why LLM enthusiasts are looking for GPUs with the most possible VRAM. You computer might have 128G of RAM, but your GPU's access to that memory is so constrained by bandwidth that you might as well run the model on your CPU. Running a model on the CPU can be done, it's just much slower because the computation is so parallel.
Today's higher end consumer grade GPUs have up to 24G of dedicated VRAM (an Nvidia RTX 5090 has 32G of VRAM and they're like $2k). The dedicated VRAM on a GPU has a memory bandwidth of about 1 Tb/s. Apple's M-series of ARM-based CPU's have 512 Gb/s of bandwidth, and they're one of the most popular ways of being able to run larger LLMs on consumer hardware. AMD's new "Strix Halo" CPU+GPU chips have up to 128G of unified memory, with a memory bandwidth of about 256 Gb/s.
Reddit's r/LocalLLaMA is a reasonable place to look to see what people are doing with consumer grade hardware. Of course, some of what they're doing is bonkers so don't take everything you see there as a guide.
And as far as a decade from now, who knows. Currently, the top silicon fabs of TSMC, Samsung, and Intel are all working flat-out to meet the GPU demand from hyperscalers rolling out capacity (Microsoft Azure, AWS, Google, etc). Silicon chip manufacturing has traditionally followed a boom/bust cycle. But with geopolitical tensions, global trade barriers, AI-driven advances, and whatever other black swan events, what the next few years will look like is anyone's guess.
I think in these scenarios, articles should include the prompt and generating model.
There are some signs it's written by possibly a non-native speaker.
Thank you for spotting the error.
All digitized books ever written/encoded compress to a few TB. The public web is ~50TB. I think a usable zip of all english electronic text publicly available would be on O(100TB). So we're at about 1% of that in model size, and we're in a diminishing-returns area of training -- ie., going to >1% has not yielded improvements (cf. gpt4.5 vs 4o).
This is why compute spend is moving to inference time with "reasoning" models. It's likely we're close to diminshing returns on inference-time compute now too, hence agents whereby (mostly,) deterministic tools are supplementing information /capability into the system.
I think to get any more value out of this model class, we'll be looking at domain-specific specialisation beyond instruction fine-tuning.
I'd guess targeting 1TB inference-time VRAM would be a reasonable medium-term target for high quality open source models -- that's within the reach of most SMEs today. That's about 250bn params.
Where you getting these numbers from? Interested to see how that's calculated.
I read somewhere, but cannot find the source anymore, that all written text prior to this century was approx 50MB. (Might be misquoted as don't have source anymore).
Extract just the plain text from that (+social media, etc.), remove symbols outside of a 64 symbol alphabet (6 bits) and compress. "Feels" to me around a 100TB max for absolutely everything.
Either way, full-fat LLMs are operating at 1-10% of this scale, depending how you want to estimate it.
If you run a more aggressive filter on that 100TB, eg., for a more semantic dedup, there's a plausible argument for "information" in english texts available being ~10TB -- then we're running close to 20% of that in LLMs.
If we take LLMs to just be that "semantic compression algorithm", and supposing the maximum useful size of an LLM is 2TB, then you could run the argument that everything "salient" ever written is <10TB.
Taking LLMs to be running at close-to 50% "everything useful" rather than 1% would be a explanation of why training has capped out.
I think the issue is at least as much to do with what we're using LLMs for -- ie., instruction fine-tuning requires some more general (proxy/quasi-) semantic structures in LLMs and I think you only need O(1%) of "everything ever written" to capture these. So it wouldnt really matter how much more we added, instruction-following LLMs don't really need it.
50 MB feels too low, unless the quote meant text up until the 20th century, in which case it feels much more believable. In terms of text production and publishing, we're still riding an exponent, so a couple orders of magnitude increase between 1899 and 2025 is not surprising.
(Talking about S-curves is all the hotness these days, but I feel it's usually a way to avoid understanding what exponential growth means - if one assumes we're past the inflection point, one can wave their hands and pretend the change is linear, and continue to not understand it.)
Any given English translation of Bible is by itself something like 3-5 megabytes of ASCII; the complete works of Shakespeare are about 5 megabytes; and I think (back of the envelope estimate) you'd get about the same again for what Arthur Conan Doyle wrote before 1900.
I can just about believe there might have been only ten thousand Bible-or-Shakespeare sized books (plus all the court documents, newspapers, etc. that add up to that) worldwide by 1900, but not ten.
Edit: I forgot about encyclopaedias, by 1900 the Encyclopædia Britannica was almost certainly more than 50 MB all by itself.
Or it may just be someone bloviating and just being wrong... I think even ancient texts could exceed that number, though perhaps not by an order of magnitude.
Most people who blog could wrote 1k words a day. That's a million in 3 years. So not crazy numbers here.
That's 5Mb. Maybe you meant 50Gb. I'd hazard 50Tb.
After that, make the robots explore and interact with the world by themselves, to fetch even more data.
In all seriousness, adding image and interaction data will probably be enormously useful, even for generating text.
There are just a lot of avenues to try at this point.
FWIW there is a huge difference between 4.5 and 4o.
Did you mean to type EB?
There's no way the entire Web fits in 400$ worth of hard drives.
Maybe text only, though...
Perhaps the 50TB estimate is unique information without any media or so, but OP can back up where they got that number from than I can do with guesswork
I tied to estimate how much data this actually is:
# annas archive stats
papers = 105714890
books = 52670695
# word count estimates
avrg_words_per_paper = 10000
avrg_words_per_book = 100000
words = (papers*avrg_words_per_paper + books*avrg_words_per_book )
# quick text of 27 million words from a few books
sample_words = 27809550
sample_bytes = 158824661
sample_bytes_comp = 28839837 # using zpaq -m5
bytes_per_word = sample_bytes/sample_words
byte_comp_ratio = sample_bytes_comp/sample_bytes
word_comp_ratio = bytes_per_word*byte_comp_ratio
print("total:", words*bytes_per_word*1e-12, "TB") # total: 30.10238345855199 TB
print("compressed:", words*word_comp_ratio*1e-12, "TB") # compressed: 5.466077036085319 TB
So uncompressed ~30 TB and compressed ~5.5 TB of data.That fits on three 2TB micro SD cards, which you could buy for a total of 750$ from SanDisk.
For example, it somehow merged Llama 4 Maverick's custom Arena chatbot version with Behemoth, falsely claiming that the former is stopping the latter from being released. It also claims 40B of internet text data is 10B tokens, which seems a little odd. Llama 405B was also trained on more than 15 trillion tokens[1], but the post claims only 3.67 trillion for some reason. It also doesn't mention Mistral large for some reason, even though it's the first good European 100B+ dense model.
>The MoE arch. enabled larger models to be trained and used by more people - people without access to thousands of interconnected GPUs
You still need thousands of GPUs to train a MoE model of any actual use. This is true for inference in the sense that it's faster I guess, but even that has caveats because MoE models are less powerful than dense models of the same size, though the trade-off has apparently been worth it in many cases. You also didn't need thousands of GPUs to do inference before, even for the largest models.
The conclusion is all over the place, and has lots of just weird and incorrect implications. The title is about how big LLMs are, why is there such a focus on token training count? Also no mention of quantized size. This is a bad AI slop article (whoops, turns out the author accidentally said it was AI generated, so it's a bad human slop article).
> it somehow merged Llama 4 Maverick's custom Arena chatbot version with Behemoth
I can clarify this part. I wrote 'There was a scandal as facebook decided to mislead people by gaming the lmarena benchmark site - they served one version of llama-4 there and released a different model' which is true.
But it is inside the section about the llama 4 model behemoth. So I see how that could be confusing/misleading.
I could restructure that section a little to improve it.
> Llama 405B was also trained on more than 15 trillion tokens[1],
You're talking about Llama 405B instruct, I'm talking about Llama 405B base. Of course the instruct model has been traiend on more tokens.
> why is there such a focus on token training count?
I tried to include the rough training token count for each model I wrote about - plus additional details about training data mixture if available. Training data is an important part of an LLM.
Turns out, size really did matter, at least at the base model level. Only with the release of truly massive dense (405B) or high-activation MoE models (DeepSeek V3, DBRX, etc) did we start seeing GPT-4-level reasoning emerge outside closed labs.
If you like darker color scheme, here it is:
https://app.charts.quesma.com/s/f07qji
And active vs total:
"The English Wikipedia, when compressed, currently occupies approximately 24 GB of storage space without media files. This compressed size represents the current revisions of all articles, but excludes media files and previous revisions of pages, according to Wikipedia and Quora."
So 3x is correct but LLMs are lossy compression.
When you train it to be an assistant model, it's better at compressing assistant transcripts than it is general text.
There is an eval which I have a lot of interested in and respect for https://huggingface.co/spaces/Jellyfish042/UncheatableEval called UncheatableEval, which tests how good of a language model an LLM is by applying it on a range of compression tasks.
This task is essentially impossible to 'cheat'. Compression is a benchmark you cannot game!
I think it's a solid description for a raw model, but it's less applicable once you start combining an LLM with better context and tools.
What's interesting to me isn't the stuff the LLM "knows" - it's how well an LLM system can serve me when combined with RAG and tools like web search and access to a compiler.
The most interesting developments right now are models like Gemma 3n which are designed to have as much capability as possible without needing a huge amount of "facts" baked into them.
This is essentially just compression and decompression. It's just that with prior compression techniques, we never tried leveraging the inherent relationships encoded in a compressed data structure, because our compression schemes did not leverage semantic information in a generalized way and thus did not encode very meaningful relationships other than "this data uses the letter 'e' quite a lot".
A lot of that comes from the sheer amount of data we throw at these models, which provide enough substrate for semantic compression. Compare that to common compression schemes in the wild, where data is compressed in isolation without contributing its information to some model of the world. It turns out that because of this, we've been leaving a lot on the table with regards to compression. Another factor has been the speed/efficiency tradeoff. GPUs have allowed us to put a lot more into efficiency, and the expectations that many language models only need to produce text as fast as it can be read by a human means that we can even further optimize for efficiency over speed.
Also, shout out to Fabrice Bellard's ts_zip, which leverages LLMs to compress text files. https://bellard.org/ts_zip/
To me the amazing thing is that you can tell the model to do something, even follow simple instructions in plain English, like make a list or write some python code to do $x, that's the really amazing part.
So text wikipedia at 24G would easily hit 8G with many standard forms of compression, I'd think. If not better. And it would be 100% accurate, full text and data. Far more usable.
It's so easy for people to not realise how massive 8GB really is, in terms of text. Especially if you use ascii instead of UTF.
They host a pretty decent article here: https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia
The relevant bit:
> As of 16 October 2024, the size of the current version including all articles compressed is about 24.05 GB without media.
Well I'll fallback position, and say one is lossy, the other not.
Then ask for the same list sorted and get that nearly instantly,
These models have a short time context for now, but they already have a huge “working memory” relative to us.
It is very cool. And indicative that vastly smarter models are going to be achieved fairly easily, with new insight.
Our biology has had to ruthlessly work within our biological/ecosystem energy envelope, and with the limited value/effort returned by a pre-internet pre-vast economy.
So biology has never been able to scale. Just get marginally more efficient and effective within tight limits.
Suddenly, (in historical, biological terms), energy availability limits have been removed, and limits on the value of work have compounded and continue to do so. Unsurprising that those changes suddenly unlock easily achieved vast untapped room for cognitive upscaling.
I don't think your second sentence logically follows from the first.
Relative to us, these models:
- Have a much larger working memory.
- Have much more limited logical reasoning skills.
To some extent, these models are able to use their superior working memories to compensate for their limited reasoning abilities. This can make them very useful tools! But there may well be a ceiling to how far that can go.
When you ask a model to "think about the problem step by step" to improve its reasoning, you are basically just giving it more opportunities to draw on its huge memory bank and try to put things together. But humans are able to reason with orders of magnitude less training data. And by the way, we are out of new training data to give the models.
Relative to the best humans, perhaps, but I seriously doubt this is true in general. Most people I work with couldn’t reason nearly as well through the questions I use LLMs to answer.
It’s also worth keeping in mind that having a different approach to reasoning is not necessarily equivalent to a worse approach. Watch out for cherry-picking the cons of its approach and ignoring the pros.
For some reason, the bar for AI is always against the best possible human, right now.
Common belief, but false. You start learning from inside the womb. The data flow increases exponentially when you open your eyes and then again when you start manipulating things with your hands and mouth.
> When you ask a model to "think about the problem step by step" to improve its reasoning, you are basically just giving it more opportunities to draw on its huge memory bank and try to put things together.
We do the same with children. At least I did it to my classmates when they asked me for help. I'd give them a hint, and ask them to work it out step by step from there. It helped.
But you don't get data equal to the entire internet as a child!
> We do the same with children. At least I did it to my classmates when they asked me for help. I'd give them a hint, and ask them to work it out step by step from there. It helped.
And I do it with my students. I still think there's a difference in kind between when I listen to my students (or other adults) reason through a problem, and when I look at the output of an AI's reasoning, but I admittedly couldn't tell you what that is, so point taken. I still think the AI is relying far more heavily on its knowledge base.
Given vision and the other senses, I’d argue that your average toddler has probably trained on more sensory information than the largest LLMs ever built long before they learn to talk.
Then there's the whole slew of processes that pick up two or three key points of data and then fill in the rest (EX the moonwalking bear experiment [0]).
I guess all I'm saying is that raw input isn't the only piece of the puzzle. Maybe it is at the start before a kiddo _knows_ how to focus and filter info?
Only easily accessible text data. We haven't really started using video at scale yet for example. It looks like data for specific tasks goes really far too ... for example agentic coding interactions aren't something that has generally been captured on the internet. But capturing interactions with coding agents, in combination with the base-training of existing programming knowledge already captured is resulting in significant performance increases. The amount of specicialed data we might need to gather or synthetically generate is perhaps orders of magnitude less that presumed with pure supervised learning systems. And for other applications like industrial automation or robotics we've barely started capturing all the sensor data that lives in those systems.
But in evolutionary time frames, clearly those limits are lifting extraordinarily quickly. By many orders of magnitude.
And the point I made, that our limits were imposed by harsh biological energy and reward limits, vs. todays models (and their successors) which have access to relatively unlimited energy, and via sharing value with unlimited customers, unlimited rewards, stands.
It is a much simpler problem to improve digital cognition in a global ecosystem of energy production, instant communication and global application, than it was for evolution to improve an individual animals cognition in the limited resources of local habitats and their inefficient communication of advances.
Although strictly speaking they have lots of information in a small package, they are F-tier compression algorithms because the loss is bad, unpredictable, and undetectable (i.e. a human has to check it). You would almost never use a transformer in place of any other compression algorithm for typical data compression uses.
...and we still can't. If your lawyer sent you your case files in the form of an LLM trained on those files, would you be comfortable with that? Where is the situation you would compress text with an LLM over a standard compression algo? (Other than to make an LLM).
Other lossy compression targets known superfluous information. MP3 removes sounds we can't really hear, and JPEG works by grouping uniform color pixels into single chunks of color.
LLM's kind of do their own thing, and the data you get back out of them is correct, incorrect, or dangerously incorrect (i.e. is plausible enough to be taken as correct), with no algorithmic way to discern which is which.
So while yes, they do compress data and you can measure it, the output of this "compression algorithm" puts in it the same family as a "randomly delete words and thesaurus long words into short words" compression algorithms. Which I don't think anyone would consider to compress their documents.
I'm not arguing that LLMs don't compress data, I am arguing that they are technically compression tools, but not colloquially compression tools, and the overlap they have with colloquial compression tools is almost zero.
Ask ten people and they'll give ten different summaries. Are humans unsuitable too?
I'm not making an argument about whether the compression is good or useful, just like I don't find 144p bitrate starved videos particularly useful. But it doesn't seem so unlike other types of compression to me.
Exactly like information from humans, then?
If the LLM-based compression method was well-understood and demonstrated to be reliable, I wouldn't oppose it on principle. If my lawyer didn't know what they were doing and threw together some ChatGPT document transfer system, of course I wouldn't trust it, but I also wouldn't trust my lawyer if they developed their own DCT-based lossy image compression algorithm.
In one view, you can view LLMs as SOTA lossless compression algorithms, where the number of weights don’t count towards the description length. Sounds crazy but it’s true.
Compressing a comprehensive command line reference via model might introduce errors and drop some options.
But for many people, especially new users, referencing commands, and getting examples, via a model would delivers many times the value.
Lossy vs. lossless are fundamentally different, but so are use cases.
and his last before departing for Meta Superintelligence https://www.youtube.com/live/U-fMsbY-kHY?si=_giVEZEF2NH3lgxI...
The more and faster a “mind” can infer, the less it needs to store.
Think how much fewer facts a symbolic system that can perform calculus needs to store, vs. an algebraic, or just arithmetic system, to cover the same numerical problem solving space. Many orders of magnitude less.
The same goes for higher orders of reasoning. General or specific subject related.
And higher order reasoning vastly increases capabilities extending into new novel problem spaces.
I think model sizes may temporarily drop significantly, after every major architecture or training advance.
In the long run, “A circa 2025 maxed M3 Ultra Mac Studio is all you need!” (/h? /s? Time will tell.)
https://scholar.google.com/scholar?hl=en&as_sdt=0%2C36&q=pre...
Its good enough that it has changed my mind about the fundamental utility of LLMs for coding in non-Javascript complexity regimes.
But its still not an expert programmer, not by a million miles, there is no way I could delegate my job to it (and keep my job). So there's some interesting boundary that's different than I used to think.
I think its in the vicinity of "how much precedent exists for this thought or idea or approach". The things I bring to the table in that setting have precedent too, but much more tenuously connected to like one clear precedent on e.g. GitHub, because if the thing I need was on GitHub I would download it.
It's the first derivative.
One factor, is the huge redundancies pervasive in our communication.
(1) There are so many ways to say the same thing, that (2) we have to add even more words to be precise at all. Without a verbal indexing system we (3) spend many words just setting up context for what we really want to say. And finally, (4) we pervasively add a great deal of intentionally non-informational creative and novel variability, and mood inducing color, which all require even more redundancy to maintain reliable interpretation, in order to induce our minds to maintain attention.
Our minds are active resistors of plain information!
All four factors add so much redundancy, it’s probably fair to say most of our communication (by bits, characters, words, etc., may be 95%?, 98%? or more!) pure redundancy.
Another helpful compressor, is many facts are among a few “reasonably expected” alternative answers. So it takes just a little biasing information to encode the right option.
Finally, the way we reason seems to be highly common across everything that matters to us. Even though we have yet to identify and characterize this informal human logic. So once that is modeled, that itself must compress a lot of relations significantly.
Fuzzy Logic was a first approximation attempt at modeling human “logic”. But has not been very successful.
Models should eventually help us uncover that “human logic”, by analyzing how they model it. Doing so may let us create even more efficient architectures. Perhaps significantly more efficient, and even provide more direct non-gradient/data based “thinking” design.
Nevertheless, the level of compression is astounding!
We are far less complicated cognitive machines that we imagine! Scary, but inspiring too.
I personally believe that common PCs of today, maybe even high end smart phones circa 2025, will be large enough to run future super intelligence when we get it right, given internet access to look up information.
We have just begun to compress artificial minds.
> The English Wikipedia, as of June 26, 2025, contains over 7 million articles and 63 million pages. The text content alone is approximately 156 GB, according to Wikipedia's statistics page. When including all revisions, the total size of the database is roughly 26 terabytes (26,455 GB)
https://en.wikipedia.org/wiki/Wikipedia:Database_download#Wh...?
How close does it come?
The vast, vast majority of LLM knowledge is not found in Wikipedia. It is definitely not its only job.
Lots of various sources that you can download locally to have available offline. They're even providing some pre-loaded devices in areas where there may not be reliable or any internet access.
It is 64,800,000,000 bits.
I can imagine 100 bits sure. And 1,000 bits why not. 10,000 you lose me. A million? That sounds like a lot. Now 64 million would be a number I can't well imagine. And this is a thousand times 64 million!
the self-execution is the interactive chat interface.
wikipedia gets "trained" (compiled+compressed+lossy) into an executable you can chat with, you can pass this through another pretrained A.I. than can talk out the text or transcribe it.
I think writing compilers is now an officially a defunct skill of historical and conservation purposes more than anything else; but I don't like saying "conservation", it's a bad framing, I rather say "legacy connectivity" which is a form of continuity or backwards compatibility
That parenthetical doesn't quite work for me.
If synthetic data always degraded performance, AI labs wouldn't use synthetic data. They use it because it helps them train better models.
There's a paper that shows that if you very deliberately train a model in its own output in a loop you can get worse performance. That's not what AI labs using synthetic data actually do.
That paper gets a lot of attention because the schadenfreude of models destroying themselves through eating their own tails is irresistible.
This is exactly what I did in a previous role, fine-tuning Llama and Mistral models on a mix of human and GPT-4 data for a domain-specific task. Adding (good) synthetic data definitely increased the output quality for our tasks.
https://gist.github.com/rain-1/cf0419958250d15893d8873682492...
2. "superintelligence"
https://en.m.wikipedia.org/wiki/Superintelligence
"Meta is uniquely positioned to deliver superintelligence to the world."
https://www.cnbc.com/2025/06/30/mark-zuckerberg-creating-met...
Is there any difference between 1 and 2
Yes. One is purely hypothetical
There is kind of a vague sense in which this metaphor holds, but there is a much more interesting and rigorous fact about LLMs which is that they are also _lossless_ compression algorithms.
There are at least two senses in which this is true:
1. You can use an LLM to losslessly compress any piece of text at a cost that approaches the log-likelihood of that text under the model, using arithmetic coding. A sender and receiver both need a copy of the LLM weights.
2. You can use an LLM plus SGD (I.e the training code) as an lossless compression algorithm, where the communication cost is area under the training curve (and the model weights don’t count towards description length!) see: Jack Rae “compression for AGI”
You're right the T5 stuff is very important historically but they're below 11B and I don't have much to say about them. Definitely a very interesting and important set of models though.
Eh?
* Gemma 1 (2024): 2B, 7B
* Gemma 2 (2024): 2B, 9B, 27B
* Gemma 3 (2025): 1B, 4B, 12B, 27B
This is the same range as some Llama models which you do mention.
> important historically
Aren't you trying to give a historical perspective? What's the point of this?
That said, there's an unstated assumption here that these truly large language models are the most interesting thing. The big players have been somewhat quiet but my impression from the outside is that OpenAI let a little bit leak with their behavior. They built an even larger model and it turned out to be disappointing so they quietly discontinued it. The most powerful frontier reasoning models may actually be smaller than the largest publicly available models.
unwind•13h ago