It's hard to map the frequency of knowledge injection to a real world understanding of "how much knowledge" can a 4B param model hold?
I've seen recent work that claimed 70% of the params are used for memorization.
The capacity definition in this recent paper is completely different - it is defined based on the kolmogorov complexity of predicting a memorized sequence, or in layman's terms: how easy it is to compress known sequences. This allows for some bit "errors", ie some symbols with bad compression ratio, only the total compression ratio of the sequence is measured.
This is somewhat parallel to the classical ECC limits (strict hamming distance constraints) vs modern probabilistic ECC limits.
TLDR when you allow a small number of errors, the capacity increases from 2 bits to 3.6 bits
I still haven't seen strong evidence that fine-tuning to add extra knowledge is effective, but I'd be delighted to learn otherwise.
For example, could there be a site like HN with ten thousand contributors where the contributions are changes to an LLM rather than posts and comments?
One issue is that if contribution A contradicts contribution B, then on HN the contradiction presents no problem (i.e., two HN comments can and often do contradict each other just fine) whereas AFAICT the LLM will need to resolve the contradiction somehow to give coherent answers on the subject matter of the contributions A and B. Then again I suppose the LLM's answer could take the form, "opinions on [subject] vary, with some maintaining that . . . whereas others claim that . . ."
This is sometimes called RAG, for Retrieval Augmented Generation.
These days the most convincing way to do this is via tool calls.
Provide your LLM harness with a tool for running searches, and tell it to use that tool any time it needs additional information.
A good "reasoning" LLM like GPT-5 or Claude 4 can even handle contradictory pieces of information - they can run additional searches if they get back confusing results and work towards a resolution, or present "both sides" to the user if they were unable to figure it out themselves.
Let's say, just in time for Jesus to save you.
However, the frontier models keep improving at a quick enough rate that it's often more effective just to wait for the general solution to catch up with your task then to spend months training a model yourself. Unless you need a particular tightly controlled behavior or need a smaller faster model or what have you. Training new knowledge in can get weird [2].
And in-context learning takes literal seconds-to-minutes of time if your information fits in the context window, so it's a lot faster to go that route if you can.
The problem is how they inject it. Their “knowledge” isn’t natural language; it’s templated Wikidata triples like "X is the capital of Y." That’s a super low-entropy, highly repetitive distribution. When you cram enough of that into a fixed token budget, you’re not really teaching the model more facts — you’re just destroying linguistic diversity and skewing the token statistics.
In real pretraining or domain adaptation scenarios, “knowledge” tends to appear in richer, more varied contexts. The practical takeaway isn’t "don’t add too much domain data," but rather "don’t overrepresent any single format or narrow syntactic pattern" The issue seems more about representation homogeneity than about factual density itself.
If the data you present is low entropy, it'll memorize. You need to make the task sufficiently complex so that memorisation stops being the easiest solution.
What colour is a tomato?
What colour is a ruby?
What colour are lips?
What colour is a strawberry?
What colour is blood?
What colour traffic light do you drive on? > Doesn't this then support the claim that LLMs aren't building world models
There's actually no strong evidence that LLMs, or any AI system, is actually building a world model.These systems are determined to have "world model" capabilities based on benchmarks, but benchmarks will never be able to tell you if such a feat is taking place. How people are claiming that these have world models is by testing them for consistency. The thing is that a world model is counterfactual. The problems with benchmarks is that they do not distinguish memorization from generalization. To make things worse, the term "Out of Distribution" (OOD) is rather fuzzy and gets abused quite a bit (I can explain more if anyone wants). Basically you should not trust any claim of "few shot" or "zero shot" and the truth is that no such claim can be made without deep knowledge of the datasets they're trained on. It helps to go back to the original zero shot papers.
One bit that might actually help in understanding things is that a world model does not actually need make correct predictions, which should show a critical flaw in benchmarking these capabilities. You can look to the history of physics and gather many great examples of this. For example, the geocentric model still had predictive powers, was counterfactual, and had a lot of accuracy. It was in fact a world model, despite being wrong. There was legitimate pushback to Galileo, specifically over tides[0]. If you like that kind of stuff I highly recommend the podcast "An Opinionated History of Mathematicas"[1].
There's a lot more complexity and nuance to this, but I'll say that there's a reason we do physics the way we do it. Benchmarks and empirical evidence play a critical role in developing physics theories and confirming those theories. But they also are not enough to build our models. (You'll also find that physicists are common dissenters of the claim of LLMs having world models. Sure, you'll also find the Max Tegmark types, but in general the consensus is against them, and for good reason).
Here's a decent paper showing a model being highly accurate yet failing to create an accurate construction of the environment[2]. The way such a thing can happen is to realize that the task diverges from the necessity to model the world. World modeling is a natural thing for humans and animals to do, because it generalizes exceptionally well, but you need to be careful in evaluating things via benchmarks and to remember that extraordinary claims require extraordinary evidence. I'd say claims of "thinking" or "world modeling" are quite extraordinary claims and we should not be hasty to attribute these characteristics when there are many reasonable and simpler alternative explanations.
[0] https://en.wikipedia.org/wiki/Discourse_on_the_Tides
[1] https://intellectualmathematics.com/opinionated-history-of-m...
[2] https://arxiv.org/abs/2406.03689
[disclosure] I have a PhD in Computer Vision and a BS in physics. I care very much about world modeling as a problem but the response I get from many of my peers is "we just care if it works." It's a concern I too share. It is the reason I ask these questions. It feels quite odd that the motivation for my questions is also used to dismiss them. (FWIW, no physicist nor former physicist has ever responded to be this way)
Essentially they found that by presenting the knowledge in a single, fixed way, the model is trained to reproduce that exact sequence of tokens, rather than "internalizing" the knowledge.
By varying the sentences, the model instead manages to separate out the knowledge, so to speak. This in turn drastically improves how well they can extract that knowledge later.
[1]: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5250633
My assumption, based on the research is that training on different prompts but the same answer gives you more robust Q&A behavior; training on variations of how to express the same concept generalizes. Training on the same prompt and different answers gives you creative diversity [2].
[1] https://arxiv.org/abs/2404.00213 [2] https://arxiv.org/abs/2503.17126
Meaning, during training, if the model expresses the same fact in some other form, maybe even with just one extra comma, that response will be marked just as wrong as a really wrong one.
In fact, the model may give an answer that’s better than the one in the training set - but it will still be punished for it and forced to change its weights because the answer doesn’t match token-for-token.
We don’t have a loss function for meaning. We only have one for token matching. Anyone who is serious about curating datasets for fine-tuning needs to take this into account.
Again, this isn't to demonize symbolic AI or to say the answer isn't in the fusion of LLMs with knowledge graphs etc, but I think we now at least know that language is certainly within reach of software and that linguistic representations of knowledge are information-dense in ways we didn't previously anticipate.
Right now it seems teams manage a reasonably sophisticated LLM layer, MCPs and instruction following is one shot context window management dependent.
gdiamos•4mo ago
I’m happy to see ML papers on hacker news.