The lottery ticket hypothesis: why neural networks work

https://nearlyright.com/how-ai-researchers-accidentally-discovered-that-everything-they-thought-about-learning-was-wrong/

30•076ae80a-3c97-4•4h ago

Comments

derbOac•1h ago

In some sense, isn't this overfitting, but "hidden" by the typical feature sets that are observed?

Time and time again, some kind of process will identify some simple but absurd adversarial "trick stimulus" that throws off the deep network solution. These seem like blatant cases of over fitting that go unrecognized or unchallenged in typical life because the sampling space of stimuli doesn't usually include the adversarial trick stimuli.

I guess I've not really thought of the bias-variance tradeoff necessarily as being about number of parameters, but rather, the flexibility of the model relative to the learnable information in the sample space. There's some formulations (e.g., Shtarkov-Rissanen normalized maximum likelihood) that treat overfitting in terms of the ability to reproduce data that is wildly outside a typical training set. This is related to, but not the same as, the number of parameters per se.

api•1h ago

This sounds like it's proposing that what's happening during large model training is a little bit akin to genetic algorithms: many small networks emerge and there is a selection process, some get fixed, and the rest fade and are then repurposed/drifted into other roles, repeat.

xg15•1h ago

Wouldn't this imply that most of the inference time storage and compute might be unnecessary?

If the hypothesis is true, it makes sense to scale up models as much as possible during training - but once the model is sufficiently trained for the task, wouldn't 99% of the weights be literal "dead weight" - because they represent the "failed lottery tickets", i.e. the subnetworks that did not have the right starting values to learn anything useful? So why do we keep them around and waste enormous amounts of storage and compute on them?

tough•1h ago

someone on twitter was exploring and linked to some related papers where you can for example trim experts on a MoE model if you're 100% sure they're never active for your specific task

what the bigger wide net bigs you is generalization

markeroon•48m ago

Look into pruning

paulsutter•32m ago

That’s exactly how it works, read up on pruning. You can ignore most of the weights and still get great results. One issue is that sparse matrices are vastly less efficient to multiply.

But yes you’ve got it

FuckButtons•8m ago

For any particular single pattern learned 99% of the weights are dead weight. But it’s not the same 99% for each lesson learned.

highfrequency•38m ago

Enjoyed the article. To play devil’s advocate, an entirely different explanation for why huge models work: the primary insight was framing the problem as next-word prediction. This immediately creates an internet-scale dataset with trillions of labeled examples, which also has rich enough structure to make huge expressiveness useful. LLMs don’t disprove bias-variance tradeoff; we just found a lot more data and the GPUs to learn from it.

It’s not like people didn’t try bigger models in the past, but either the data was too small or the structure too simple to show improvements with more model complexity. (Or they simply trained the biggest model they could fit on the GPUs of the time.)

pixl97•4m ago

I think a lot of it is the massive amount of compute we've got in the last decade. While inference may have been possible on the hardware the training would have taken lifetimes.

belter•37m ago

This article is like a quick street rap. Lots of rhythm, not much thesis. Big on tone, light on analysis...Or no actual thesis other than a feelgood factor. I want these 5 min back.

abhinuvpitale•34m ago

Interesting article, is it concluding that different small networks are formed for different types of problems that we are trying to solve with the larger network?

How is this different from overfitting though? (PS: Overfitting isn't that bad if you think about it, as long as the test dataset or inference time model is trying to solve problems in the supposedly large enough training dataset)

deepfriedchokes•31m ago

Rather than reframing intelligence itself, wouldn’t Occam’s Razor suggest instead that this isn’t intelligence at all?

gotoeleven•13m ago

This article gives a really bad/wrong explanation of the lottery ticket hypothesis. Here's the original paper

https://arxiv.org/abs/1803.03635

ghssds•9m ago

Can someone explain how AI research can have a 300 years history?

Show HN: Xbow raised $117M to build AI hackers, I open-sourced it for free

Show HN: Whispering – Open-source, local-first dictation you can trust

Left to Right Programming

Show HN: I built an app to block Shorts and Reels

FFmpeg Assembly Language Lessons

Counter-Strike: A billion-dollar game built in a dorm room

T-Mobile claimed selling location data without consent is legal–judges disagree

A minimal tensor processing unit (TPU), inspired by Google's TPU

GenAI FOMO has spurred businesses to light nearly $40B on fire

Anna's Archive: An Update from the Team

The Weight of a Cell

Web apps in a single, portable, self-updating, vanilla HTML file

Launch HN: Reality Defender (YC W22) – API for Deepfake and GenAI Detection

The Cutaway Illustrations of Fred Freeman

Sikkim and the Himalayan Chess Game

TREAD: Token Routing for Efficient Architecture-Agnostic Diffusion Training

How much do electric car batteries degrade?

Typechecker Zoo

Who Invented Backpropagation?

My Retro TVs

Show HN: I built a toy TPU that can do inference and training on the XOR problem

Mindless Machines, Mindless Myths

Electromechanical reshaping, an alternative to laser eye surgery

Countrywide natural experiment links built environment to physical activity

Finding a Successor to the FHS

Show HN: We started building an AI dev tool but it turned into a Sims-style game

Macintosh Drawing Software Compared

The lottery ticket hypothesis: why neural networks work

Image Fulgurator (2011)

SystemD Service Hardening

The lottery ticket hypothesis: why neural networks work

Comments

Show HN: Xbow raised $117M to build AI hackers, I open-sourced it for free

Show HN: Whispering – Open-source, local-first dictation you can trust

Left to Right Programming

Show HN: I built an app to block Shorts and Reels

FFmpeg Assembly Language Lessons

Counter-Strike: A billion-dollar game built in a dorm room

T-Mobile claimed selling location data without consent is legal–judges disagree

A minimal tensor processing unit (TPU), inspired by Google's TPU

GenAI FOMO has spurred businesses to light nearly $40B on fire

Anna's Archive: An Update from the Team

The Weight of a Cell

Web apps in a single, portable, self-updating, vanilla HTML file

Launch HN: Reality Defender (YC W22) – API for Deepfake and GenAI Detection

The Cutaway Illustrations of Fred Freeman

Sikkim and the Himalayan Chess Game

TREAD: Token Routing for Efficient Architecture-Agnostic Diffusion Training

How much do electric car batteries degrade?

Typechecker Zoo

Who Invented Backpropagation?

My Retro TVs

Show HN: I built a toy TPU that can do inference and training on the XOR problem

Mindless Machines, Mindless Myths

Electromechanical reshaping, an alternative to laser eye surgery

Countrywide natural experiment links built environment to physical activity

Finding a Successor to the FHS

Show HN: We started building an AI dev tool but it turned into a Sims-style game

Macintosh Drawing Software Compared

The lottery ticket hypothesis: why neural networks work

Image Fulgurator (2011)

SystemD Service Hardening