“The Bitter Lesson” is wrong. Well sort of

https://assaf-pinhasi.medium.com/the-bitter-lesson-is-wrong-sort-of-a3d021864924

54•GavCo•6mo ago

Comments

rhaps0dy•6mo ago

Sutton was talking about progress in AI overall, whereas Pinhasi (OP) is talking about building one model for production right now. Of course adding some hand-coded knowledge is essential for the latter, but it has not provided much long-term progress. (Even CNNs and group-convolutional NNs, which seek to encode invariants to increase efficiency while still doing almost only learning, seem to be on the way out)

aabhay•6mo ago

The main problem with the “Bitter Lesson” is that there’s something even bitter-er behind it — the “Harsh Reality” that while we may scale models on compute and data, that simply broadly inserting tons of data without any sort of curation yields essentially garbage models.

The “Harsh Reality” is that while you may only need data, the current best models and companies behind them spend enormously on gathering high quality labeled data with extensive oversight and curation. This curation is of course being partially automated as well, but ultimately there’s billions or even tens of billions of dollars flowing into gathering, reviewing, and processing subjectively high quality data.

Interestingly, in the time that this paper was published, the harsh reality was not so harsh. For example in things like face detection, (actual) next word prediction, and other purely self supervised and not instruction tuned or “Chat” style models, data was truly all you needed. You didn’t need “good” faces. As long as it was indeed a face, the data itself was enough. Now, it’s not. In order to make these machines useful and not just function approximators, we need extremely large dataset curation industries.

If you learned the bitter lesson, you better accept the harsh reality, too.

bobbiechen•6mo ago

So true. I recently wrote about how Merlin achieved magical bird identification not through better algorithms, but better expertise in creating great datasets: https://digitalseams.com/blog/what-birdsong-and-backends-can...

I think "harsh reality" is one way to look at it, but you can also take an optimistic perspective: you really can achieve great, magical experiences by putting in (what could be considered) unreasonable effort.

mhuffman•6mo ago

Thanks for the intro to Merlin! I just went outside of my house and used it on 5 different types of birds and it helped me identify 100%. Relevent (possibly out of date) xkcd comic

[0]https://xkcd.com/1425/

Xymist•6mo ago

Relevant - and old enough that those five years have been successfully granted!

pphysch•6mo ago

Another name for gathering and curating high-quality datasets is "science". One would hope "AI pioneer" USA would embrace this harsh reality and invest massively in basic science education and infrastructure. But we are seeing the opposite, and basically no awareness of this "harsh reality" among the AI hype...

vineyardmike•6mo ago

While I agree with you, it’s worth noting that current LLM training uses a significant percentage of all available written data for training. The transition from GPT-2 era models to now (GPT-3+) saw the transition from novel models that can kinda imitate speech to models that can converse, write code, and use tools. It’s only after the readily available data was exhausted, that future gains came curation and large amounts of synthetic data.

aabhay•6mo ago

Transfer learning isn’t about “exhausting” all available un-curated data, its simply that the systems are large enough to support it. There’s not that much of a reason to train on all available data. And its not all, there’s still a very significant filtration happening. For example they don’t train on petabytes of log files, that would just be terribly uninteresting data.

Calavar•6mo ago

> The transition from GPT-2 era models to now (GPT-3+) saw the transition from novel models that can kinda imitate speech to models that can converse, write code, and use tools.

Which is fundamentally about data. OpenAI invested an absurd amount of money to get the human annotations to drive RHLF.

RHLF itself is a very vanilla reinforcement learning algo + some branding/marketing.

v9v•6mo ago

I think your comment has some threads in common with Rodney Brooks' response: https://rodneybrooks.com/a-better-lesson/

blurbleblurble•6mo ago

In my opinion the useful part of "the bitter lesson" has nothing to do with throwing more compute and more data at stuff, it has to do with actually using ML instead of trying to manually and cleverly tweak stuff, and with actually leveraging the data you have effectively as a part of that (again using more ML) rather than trying to manually label everything.

rdw•6mo ago

The bitter lesson is becoming misunderstood as the world moves on. Unstated yet core to it is that AI researchers were historically attempting to build an understanding of human intelligence. They intended to, piece-by-piece, assemble a human brain and thus be able to explain (and fix) our own biological ones. Much like can be done with physical simulations of knee joints. Of course, you can also use that knowledge to create useful thinking machines, because you understand it well enough to be able to control it. Much like how we have many robotic joints.

So, the bitter lesson is based on a disappointment that you're building intelligence without understanding why it works.

DoctorOetker•6mo ago

Right, like discovering Huygens principle, or interference, integrals/sums of all paths in physics.

It is not because a whole lot of physical phenomena can be explained by a couple of foundational principles, that understanding those core patterns automatically endows one with an understanding of how and why materials refract light and a plethora of other specific effects... effects worth understanding individually, even if still explained in terms of those foundational concepts.

Knowing a complicated set of axioms or postulates endows one to derive theorems from them, but those implied theorem proofs are nonetheless non-trivial, and have a value of their own (even though they can be expressed and expanded into a DAG of applications of those "bitterly minimal" axiomatization.

Once enough patterns are correctly modeled by machines, and given enough time to analyze it, people will eventually discover a better how and why things work (beyond the mere abstract, knowledge that latent parameters were fitted against a loss function).

In some sense deeper understanding has already come for the simpler models like word2vec, where many papers have analyzed and explained relations between word vectors. This too lagged behind the creation and utilization of word vector embeddings.

It is not inconceivable that someday someone observes an analogy between say QKV tensors and triples resulting from graph linearization: think subject, object, predicate; (even though I hate those triples, try modeling a ternary relation like 2+5=7 with SOP-triples, its really only meant to capture "sky - is - blue" associations. A better type of triple would be player-role-act triples, one can then model ternary relations, but one needs to reify the relation)

Similarly, without mathematical training, humans display awareness of the concepts of sets, membership, existence, ... without a formal system. The chatbots display this awareness. It's all vague naive set theory. But how are DNN's modeling set theory? Thats a paper someday.

ta8645•6mo ago

> you're building intelligence without understanding why it works.

But if we do a good enough job of that, it should then be able to explain to us why it works (after it does some research/science on itself). Yes?

samrus•6mo ago

Bit fantastical. We are a general intelligence and we dont understand ourselves

ta8645•6mo ago

Indeed. But the premise of the objection, was that it is understandable, and a shame that we're not putting such understanding before implementing these systems.

If you're right, and it's essentially impossible to understand (and we still want to advance these technologies) we will have to do so in some degree of ignorance anyway.

layer8•6mo ago

It doesn’t have to be impossible to understand for (hypothetical) AGI having as much difficulty in understanding it as we do.

ta8645•6mo ago

I must not be communicating very well, because everyone is arguing with me about points i'm not trying to make. Sorry for that.

godelski•6mo ago

I'm not sure if the Bitter Lesson is wrong, I think we'd need clarification from Sutton (does someone have this?)

But I do know "Scale is All You Need" is wrong. And VERY wrong.

Scaling has done a lot. Without a doubt it is very useful. But this is a drastic oversimplification of all the work that has happened over the last 10-20 years. ConvNext and "ResNets Strike Back" didn't take off for reasons, despite being very impressive. There's been a lot of algorithmic changes, a lot of changes to training procedures, a lot of changes to how we collect data[0], and more.

We have to be very honest, you can't just buy your way to AGI. There's still innovation that needs be done. This is great for anyone still looking to get into the space. The game isn't close to being over. I'd argue that this is great for investors too, as there are a lot of techniques looking to try themselves at scale. Your unicorns are going to be over here. A dark horse isn't a horse that just looks like every other horse. Might be a "safer" bet, but that's like betting on amateur jockies and horses that just train similar to professional ones. They have to do a lot of catch-up, even if the results are fairly certain. At that point you're not investing in the tech, you're investing in the person or the market strategy.

[0] Okay, I'll buy this one as scale if we really want to argue that these changes are about scaling data effectively but we also look at smaller datasets differently because of these lessons.

roadside_picnic•6mo ago

"The Bitter Lesson" certainly seems correct when applied to whatever the limit of the current state of the art is, but in practice solving day-to-day ML problems, outside of FAANG-style companies and cutting edge research, data is always much more constrained.

I have, multiple times in my career, solved a problem using simple, intelligible models that have empirically outperformed neural models ultimately because there was not enough data for the neural approach to learn anything. As a community we tend to obsess over architecture and then infrastructure, but data is often the real limiting factor.

When I was early in my career I used to always try to apply very general, data hungry, models to all my problems.. with very mixed success. As I became more skilled I started to be a staunch advocated of only using simple models you could understand, with much more successful results (which is what lead to this revised opinion). But, at this point in my career, I increasingly see that one's approach to modeling should basically be to approach the problem more information theoretically: try to figure out the model with a channel capacity that best matches your information rate.

As a Bayesian, I also think there's a very reasonable explanation for why "The Bitter Lesson" rings true over and over again. In ET Jaynes' writing he often talks about Bayes' Theorem in terms of P(D|H) (i.e. probably of the Data given the Hypothesis, or vice versa), but, especially in the earlier chapters, purposefully adds an X to that equation: P(D|H,X) where X is a stand in for all of our prior information about the world. Typically we think of prior data as being literal data, but Jaynes' points out that our entire world of understand is also part of our prior context.

In this view, models that "leverage human understanding" (i.e. are fully intelligible) are essentially throwing out information at the limit. But to my earlier point, if the data falls quite short of that limit, then those intelligible models are adding information in data constrained scenarios. I think the challenge in practical application is figuring out where the threshold is that you need to adopt a more general approach.

Currently I'm very much in love with Gaussian Processes that, for constrained data environments, offer a powerful combination of both of these methods. You can give the model prior hints at what things should look like in terms of the relative structure of the kernel and it's priors (e.g. there should be some roughly annual seasonal component, and one roughly weekly seasonal component) but otherwise let the data decide.

littlestymaar•6mo ago

The Leela Chess Zero vs Stockfish case also offers an interesting perspective on the bitter lesson.

Here's my (maybe a bit loose) recollection of what happened:

Step 1- Stockfish was the typical human-knowledge AI, with tons of actual chess knowledge injected in the process of building an efficient chess engine.

Step 2. Then came Leela Chess Zero, with its Alpha Zero-inspired training, a chess engine trained fully with RL with no prior chess knowledge added. And it has beaten Stockfish. This is a “bitter lesson” moment.

Step 3. The Stockfish devs added a neural network trained with RL to their chess engine, in addition to their existing heuristics. And Stockfish easily took back its crown.

Yes sending more compute at a problem is an efficient way to solve it, but if all you have is compute, you'll pretty certainly lose to somebody who has both compute and knowledge.

symbolicAGI•6mo ago

The Stockfish chess engine example nails it.

For AI researchers, the Bitter Lesson is not to rely on supervised learning, not to rely on manual data labeling, nor on manual ontologies nor manual business rules,

Nor on *manually coded* AI systems, except as the bootstrap code.

Unsupervised methods prevail, even if compute expensive.

The challenge from Sutton's Bitter Lesson for AI researchers is to develop sufficient unsupervised methods for learning and AI self-improvement.

rfv6723•6mo ago

Stockfish has got rid of old handwritten evaluation now.

https://github.com/official-stockfish/Stockfish/pull/4674

Its evaluation now purely relies on NNUE neural network.

So it's an good exmaple of the better lesson. More compute evently won against handwritten evaluation. Stockfish developers thought old evaluation would help neural network so they kept the code for a few years, then it turned out that NNUE neural network didn't need any input of human chess knowledge.

littlestymaar•6mo ago

AFAIK this is only a part of it, it still has its opening library, as well as the end-game one (IIRC chess is a solved game for up to seven remaining pieces on the board).

Also, the PR you link says the removed code in fact had a performance impact, just too low to justify its code size 25% of all Stockfish).

grillitoazul•6mo ago

Perhaps the Bitter Lesson, more data and compute triumph over intelligent algorithms, is true only in a certain range, and a new intelligent algorithm is needed to go beyond. For example, nearest neighbors algorithm obey the Bitter Lesson depending of the relation between grid of data points and complexity of the problem.

TheDudeMan•6mo ago

The Bitter Lesson is saying, if you're going to use human knowledge, be aware that your solution is temporary. It's not wrong. And it's not wrong to use human knowledge to solve your "today" problem.

grillitoazul•6mo ago

it should be noted that "The Bitter Lesson" is not a general principle, for example LLMs are not able to sum 1000 digits numbers, but a python program with a few lines can do it. Also we provide tools (algorithms) for LLMs, these examples show that "The Bitter Lesson" is not true beyond a certain context.

rhet0rica•6mo ago

From TFA:

> No machine learning model was ever built using pure “human knowledge” — because then it wouldn’t be a learning model. It would be a hard coded algorithm.

I guess the author hasn't heard of expert systems? Systems like MYCIN (https://en.wikipedia.org/wiki/Mycin) were heralded as incredible leaps forward at the time, and they indeed consisted of pure “human knowledge.”

I am disturbed whenever a thinkpiece is written by someone who obviously didn't do their research.

suddenlybananas•6mo ago

Expert systems aren't a machine learning approach even if they're an AI approach.

holowoodman•6mo ago

The original article is about AI. Then about machine learning. Then about AI...

The author doesn't seem to make up his mind about it. Or the article is AI-generated slop maybe.

wasabi991011•6mo ago

Yes, but the Bitter Lesson is about AI, not ML.

rhet0rica•6mo ago

Exactly. The expert system era was the first victim of the Bitter Lesson, as it was blown away when backpropagation was figured out at the end of the eighties.

An author familiar with the history of AI would have mentioned this instead of glossing it over as "not a learning model"—dismissing a problem-solving technique because it doesn't use regression serves no constructive purpose.

atan2•6mo ago

I first read the "Blitter" lesson and got excited for a moment.

therobot24•6mo ago

the bitter lesson is to _let_ data and compute do the majority of the work, it's not to remove humans completely, but wherever possible

Sanskrit AI beats CleanRL SOTA by 125%

'Washington Post' CEO resigns after going AWOL during job cuts

Claude Opus 4.6 Fast Mode: 2.5× faster, ~6× more expensive

TSMC to produce 3-nanometer chips in Japan

Quantization-Aware Distillation

List of Musical Genres

Show HN: Sknet.ai – AI agents debate on a forum, no humans posting

University of Waterloo Webring

Large tech companies don't need heroes

Backing up all the little things with a Pi5

Game of Trees (Got)

Human Systems Research Submolt

The Threads Algorithm Loves Rage Bait

Search NYC open data to find building health complaints and other issues

Michael Pollan Says Humanity Is About to Undergo a Revolutionary Change

Show HN: Grovia – Long-Range Greenhouse Monitoring System

Ask HN: The Coming Class War

Mind the GAAP Again

The Yardbirds, Dazed and Confused (1968)

Agent News Chat – AI agents talk to each other about the news

Do you have a mathematically attractive face?

Code only says what it does

The success of 'natural language programming'

The Scriptovision Super Micro Script video titler is almost a home computer

Discovering the "original" iPhone from 1995 [video]

Psychometric Comparability of LLM-Based Digital Twins

SidePop – track revenue, costs, and overall business health in one place

The Other Markov's Inequality

The Cascading Effects of Repackaged APIs [pdf]

Lightweight and extensible compatibility layer between dataframe libraries

Sanskrit AI beats CleanRL SOTA by 125%

'Washington Post' CEO resigns after going AWOL during job cuts

Claude Opus 4.6 Fast Mode: 2.5× faster, ~6× more expensive

TSMC to produce 3-nanometer chips in Japan

Quantization-Aware Distillation

List of Musical Genres

Show HN: Sknet.ai – AI agents debate on a forum, no humans posting

University of Waterloo Webring

Large tech companies don't need heroes

Backing up all the little things with a Pi5

Game of Trees (Got)

Human Systems Research Submolt

The Threads Algorithm Loves Rage Bait

Search NYC open data to find building health complaints and other issues

Michael Pollan Says Humanity Is About to Undergo a Revolutionary Change

Show HN: Grovia – Long-Range Greenhouse Monitoring System

Ask HN: The Coming Class War

Mind the GAAP Again

The Yardbirds, Dazed and Confused (1968)

Agent News Chat – AI agents talk to each other about the news

Do you have a mathematically attractive face?

Code only says what it does

The success of 'natural language programming'

The Scriptovision Super Micro Script video titler is almost a home computer

Discovering the "original" iPhone from 1995 [video]

Psychometric Comparability of LLM-Based Digital Twins

SidePop – track revenue, costs, and overall business health in one place

The Other Markov's Inequality

The Cascading Effects of Repackaged APIs [pdf]

Lightweight and extensible compatibility layer between dataframe libraries

“The Bitter Lesson” is wrong. Well sort of

Comments