If the hypothesis is true, it makes sense to scale up models as much as possible during training - but once the model is sufficiently trained for the task, wouldn't 99% of the weights be literal "dead weight" - because they represent the "failed lottery tickets", i.e. the subnetworks that did not have the right starting values to learn anything useful? So why do we keep them around and waste enormous amounts of storage and compute on them?
what the bigger wide net bigs you is generalization
But yes you’ve got it
It’s not like people didn’t try bigger models in the past, but either the data was too small or the structure too simple to show improvements with more model complexity. (Or they simply trained the biggest model they could fit on the GPUs of the time.)
Compute was just too expensive to have neural networks big enough for this not to be true.
Overfitting is the problem of having too little training data for your network size.
I'm sure that's what the text book meant, rather than any point about the expense of computing power.
The leap is in transforming an ill-defined objective of "modeling intelligence" into a concrete proxy objective. Note that the task isn't even "distribution set of valid/true things", since validity/truth is hard to define. It's something akin to "distribution of things a human might say" implemented in the "dumbest" possible way of "modeling the distribution of humanity's collective textual output".
so I'm always surprised when some linguists decry LLMs, and cling to old linguistics paradigms instead of reclaiming the important role of language as (a) vehicle of intelligence.
And therefore I always thought that the more you master a language the better you are able to reason.
And considering how much we let LLMs formulate text for us, how dumb will we get?
Is that true though? Seems like you can easily have some cognitive process that visualizes things like cause and effect, simple algorithms or at least sequences of events.
That's not true. You can think "I want to walk around this building" without words, in abstract thoughts or in images.
Words are a layer above the thoughts, not the thoughts themselves. You can confirm this if you ever had the experience of trying to say something but forgetting the right word. Your mind knew what it wants to say but it didn't knew the word.
Chess players operate on sequences of moves dozen turns ahead in their minds using no words, seeing the moves on the virtual chessboards they imagine.
Musicians hear the note they want to play in their minds.
Our brains have full multimedia support.
Without all three, progress would have been much slower.
"In many industries where giant data sets simply don’t exist, I think the focus has to shift from big data to good data. Having 50 thoughtfully engineered examples can be sufficient to explain to the neural network what you want it to learn."
Perhaps there is a kind of unstable situation whereby once the winner starts to anneal toward predicting the training data, it is doing more and more of the predictive work. The more relevant the subset shows itself to the result, the more of the learning it captures, because it is more responsive to subsequent learning.
Sure, this could’ve been a paragraph, but it wasn’t. I don’t think it’s particularly offensive for that.
Tell me, what is it you plan to do
with your five wild and precious minutes?
How is this different from overfitting though? (PS: Overfitting isn't that bad if you think about it, as long as the test dataset or inference time model is trying to solve problems in the supposedly large enough training dataset)
In some ways this answer does fit Occam's Razor by saying the simplicity is simply scale, not complex algorithms.
I think the word finding is overloaded, here. Are we "discovering," "deriving," "deducing," or simple "looking up" these patterns?
If "finding" can be implemented via a multi-page tour—ie deterministic choose-your-own-adventure—of a three-ring-binder (which is, essentially, how inference operates) then we're back at Searle's Chinese Room, and no intelligence is operative at runtime.
On the other hand, if the satisfaction of "finding" necessitates the creative synthesis of novel records pertaining to—if not outright modeling—external phenomena, ie "finding" a proof, then arguably it's not happening at training time, either.
How many novel proofs have LLMs found?
The ability to compress information, specifically run it through a simple rule that allows you to predict some future state.
The rules are simple but finding them is hard. The ability to find those rules, compress information, and thus predict the future efficiently is the very essence of intelligence.
The bias-variance tradeoff is a very old concept in statistics (but not sure how old, might very well be 300)
Anyway note the first algorithms realted to neural networks are older then the digital computer by a decade at least.
> 1. The 300-year timeframe refers to the foundational mathematical principles underlying modern bias-variance analysis, not the contemporary terminology. Bayes' theorem (1763) established the mathematical framework for updating beliefs with evidence, whilst Laplace's early work on statistical inference (1780s-1810s) formalised the principle that models must balance fit with simplicity to avoid spurious conclusions. These early statistical insights—that overly complex explanations tend to capture noise rather than signal—form the mathematical bedrock of what we now call the bias-variance tradeoff. The specific modern formulation emerged over several decades in the latter 20th century, but the core principle has governed statistical reasoning for centuries.↩
This seems strangely worded. I assume that date is when some statistics paper was published, but there's no way to know with no definition or citations.
> 1. The 300-year timeframe refers to the foundational mathematical principles underlying modern bias-variance analysis, not the contemporary terminology. Bayes' theorem (1763) established the mathematical framework for updating beliefs with evidence, whilst Laplace's early work on statistical inference (1780s-1810s) formalised the principle that models must balance fit with simplicity to avoid spurious conclusions. These early statistical insights—that overly complex explanations tend to capture noise rather than signal—form the mathematical bedrock of what we now call the bias-variance tradeoff. The specific modern formulation emerged over several decades in the latter 20th century, but the core principle has governed statistical reasoning for centuries.
Also, I don't necessarily feel like the size of LLMs even comes close to overfitting the data. From a very unscientific standpoint it seems like the size of weights on disk would have to meet or exceed the size of the training data (modulo lossless encryption techniques) for overfitting to occur. Since the training data is multiple orders of magnitude larger than the resulting weights, isn't that proof that the weights are some sort of generalization of the input data rather than a memorization?
2) The weights are definitely a generalization. The compression-based argument is sound.
3) There is definitely no overfitting. The article however used the word over-parameterization, which is a different thing. And LLMs are certainly over-parameterized. They have more parameters than strictly required to represent the dataset in a degrees-of-freedom statistical sense. This is not a bad thing though.
Just like having an over-parameterized database schema:
quiz(id, title, num_qns)
question(id, text, answer, quiz_id FK)
can be good for performance sometimes,The lottery ticket hypothesis as chatgpt explained in TFA means that over-parameterization can also be good for neural networks sometimes. Note that this hypothesis is strictly tied to the fact that we use SGD (or adam or ...) as the optimisation algorithm. SGD is known to be biased towards generalized compressions [the lottery ticket hypothesis hypothesises why this is so]. That is to say, it's not an inherent property of the neural network architecture or transformers or such.
The issue with Vapnik's work is that it's pretty dense and actually figuring out the Vapnik-Chervonekis (VC) dimension etc is pretty complicated, and one can develop pretty good intuition once you understand the stuff without having to actually calculate, so most people don't take the time to do the calculation. And frankly, a lot of the time, you don't need to.
There may be something I'm missing completely, but to me the fact that models continue to generalize with a huge number of parameters is not all that surprising given how much we regularize when we fit NNs. A lot of the surprise comes from the fact that people in mathematical statistics and people who do neural networks (computer scientists) don't talk to each other as much as they should.
Strongly recommend the book Statistical Learning Theory by Vapnik for more on this.
I think the machine learning community was largely over overfitophobia by 2019 and people were routinely using overparametrized models capable of interpolating their training data while still generalizing well.
The Belkin et al. paper wasn't heresy. The authors were making a technical point - that certain theories of generalization are incompatible with this interpolation phenomenon.
The lottery ticket hypothesis paper's demonstration of the ubiquity of "winning tickets" - sparse parameter configurations that generalize - is striking, but these "winning tickets" aren't the solutions found by stochastic gradient descent (SGD) algorithms in practice. In the interpolating regime, the minima found by SGD are simple in a different sense perhaps more closely related to generalization. In the case of logistic regression, they are maximum margin classifiers; see https://arxiv.org/pdf/1710.10345.
The article points out some cool papers, but the narrative of plucky researchers bucking orthodoxy in 2019 doesn't track for me.
Back in 2000s, the reason why nobody was pursuing neural nets was simply due to compute power, and the fact that you couldn't iterate fast enough to make smaller neural networks work.
People were doing genetic algorithms and PSO for quite some time. Everyone knew that multi dimentionality was the solution to overfitting - the more directions you can use to climb out of valleys the better the system performed.
derbOac•5mo ago
Time and time again, some kind of process will identify some simple but absurd adversarial "trick stimulus" that throws off the deep network solution. These seem like blatant cases of over fitting that go unrecognized or unchallenged in typical life because the sampling space of stimuli doesn't usually include the adversarial trick stimuli.
I guess I've not really thought of the bias-variance tradeoff necessarily as being about number of parameters, but rather, the flexibility of the model relative to the learnable information in the sample space. There's some formulations (e.g., Shtarkov-Rissanen normalized maximum likelihood) that treat overfitting in terms of the ability to reproduce data that is wildly outside a typical training set. This is related to, but not the same as, the number of parameters per se.