The “Harsh Reality” is that while you may only need data, the current best models and companies behind them spend enormously on gathering high quality labeled data with extensive oversight and curation. This curation is of course being partially automated as well, but ultimately there’s billions or even tens of billions of dollars flowing into gathering, reviewing, and processing subjectively high quality data.
Interestingly, in the time that this paper was published, the harsh reality was not so harsh. For example in things like face detection, (actual) next word prediction, and other purely self supervised and not instruction tuned or “Chat” style models, data was truly all you needed. You didn’t need “good” faces. As long as it was indeed a face, the data itself was enough. Now, it’s not. In order to make these machines useful and not just function approximators, we need extremely large dataset curation industries.
If you learned the bitter lesson, you better accept the harsh reality, too.
I think "harsh reality" is one way to look at it, but you can also take an optimistic perspective: you really can achieve great, magical experiences by putting in (what could be considered) unreasonable effort.
Which is fundamentally about data. OpenAI invested an absurd amount of money to get the human annotations to drive RHLF.
RHLF itself is a very vanilla reinforcement learning algo + some branding/marketing.
So, the bitter lesson is based on a disappointment that you're building intelligence without understanding why it works.
It is not because a whole lot of physical phenomena can be explained by a couple of foundational principles, that understanding those core patterns automatically endows one with an understanding of how and why materials refract light and a plethora of other specific effects... effects worth understanding individually, even if still explained in terms of those foundational concepts.
Knowing a complicated set of axioms or postulates endows one to derive theorems from them, but those implied theorem proofs are nonetheless non-trivial, and have a value of their own (even though they can be expressed and expanded into a DAG of applications of those "bitterly minimal" axiomatization.
Once enough patterns are correctly modeled by machines, and given enough time to analyze it, people will eventually discover a better how and why things work (beyond the mere abstract, knowledge that latent parameters were fitted against a loss function).
In some sense deeper understanding has already come for the simpler models like word2vec, where many papers have analyzed and explained relations between word vectors. This too lagged behind the creation and utilization of word vector embeddings.
It is not inconceivable that someday someone observes an analogy between say QKV tensors and triples resulting from graph linearization: think subject, object, predicate; (even though I hate those triples, try modeling a ternary relation like 2+5=7 with SOP-triples, its really only meant to capture "sky - is - blue" associations. A better type of triple would be player-role-act triples, one can then model ternary relations, but one needs to reify the relation)
Similarly, without mathematical training, humans display awareness of the concepts of sets, membership, existence, ... without a formal system. The chatbots display this awareness. It's all vague naive set theory. But how are DNN's modeling set theory? Thats a paper someday.
But if we do a good enough job of that, it should then be able to explain to us why it works (after it does some research/science on itself). Yes?
But I do know "Scale is All You Need" is wrong. And VERY wrong.
Scaling has done a lot. Without a doubt it is very useful. But this is a drastic oversimplification of all the work that has happened over the last 10-20 years. ConvNext and "ResNets Strike Back" didn't take off for reasons, despite being very impressive. There's been a lot of algorithmic changes, a lot of changes to training procedures, a lot of changes to how we collect data[0], and more.
We have to be very honest, you can't just buy your way to AGI. There's still innovation that needs be done. This is great for anyone still looking to get into the space. The game isn't close to being over. I'd argue that this is great for investors too, as there are a lot of techniques looking to try themselves at scale. Your unicorns are going to be over here. A dark horse isn't a horse that just looks like every other horse. Might be a "safer" bet, but that's like betting on amateur jockies and horses that just train similar to professional ones. They have to do a lot of catch-up, even if the results are fairly certain. At that point you're not investing in the tech, you're investing in the person or the market strategy.
[0] Okay, I'll buy this one as scale if we really want to argue that these changes are about scaling data effectively but we also look at smaller datasets differently because of these lessons.
I have, multiple times in my career, solved a problem using simple, intelligible models that have empirically outperformed neural models ultimately because there was not enough data for the neural approach to learn anything. As a community we tend to obsess over architecture and then infrastructure, but data is often the real limiting factor.
When I was early in my career I used to always try to apply very general, data hungry, models to all my problems.. with very mixed success. As I became more skilled I started to be a staunch advocated of only using simple models you could understand, with much more successful results (which is what lead to this revised opinion). But, at this point in my career, I increasingly see that one's approach to modeling should basically be to approach the problem more information theoretically: try to figure out the model with a channel capacity that best matches your information rate.
As a Bayesian, I also think there's a very reasonable explanation for why "The Bitter Lesson" rings true over and over again. In ET Jaynes' writing he often talks about Bayes' Theorem in terms of P(D|H) (i.e. probably of the Data given the Hypothesis, or vice versa), but, especially in the earlier chapters, purposefully adds an X to that equation: P(D|H,X) where X is a stand in for all of our prior information about the world. Typically we think of prior data as being literal data, but Jaynes' points out that our entire world of understand is also part of our prior context.
In this view, models that "leverage human understanding" (i.e. are fully intelligible) are essentially throwing out information at the limit. But to my earlier point, if the data falls quite short of that limit, then those intelligible models are adding information in data constrained scenarios. I think the challenge in practical application is figuring out where the threshold is that you need to adopt a more general approach.
Currently I'm very much in love with Gaussian Processes that, for constrained data environments, offer a powerful combination of both of these methods. You can give the model prior hints at what things should look like in terms of the relative structure of the kernel and it's priors (e.g. there should be some roughly annual seasonal component, and one roughly weekly seasonal component) but otherwise let the data decide.
Here's my (maybe a bit loose) recollection of what happened:
Step 1- Stockfish was the typical human-knowledge AI, with tons of actual chess knowledge injected in the process of building an efficient chess engine.
Step 2. Then came Leela Chess Zero, with its Alpha Zero-inspired training, a chess engine trained fully with RL with no prior chess knowledge added. And it has beaten Stockfish. This is a “bitter lesson” moment.
Step 3. The Stockfish devs added a neural network trained with RL to their chess engine, in addition to their existing heuristics. And Stockfish easily took back its crown.
Yes sending more compute at a problem is an efficient way to solve it, but if all you have is compute, you'll pretty certainly lose to somebody who has both compute and knowledge.
For AI researchers, the Bitter Lesson is not to rely on supervised learning, not to rely on manual data labeling, nor on manual ontologies nor manual business rules,
Nor on *manually coded* AI systems, except as the bootstrap code.
Unsupervised methods prevail, even if compute expensive.
The challenge from Sutton's Bitter Lesson for AI researchers is to develop sufficient unsupervised methods for learning and AI self-improvement.
https://github.com/official-stockfish/Stockfish/pull/4674
Its evaluation now purely relies on NNUE neural network.
So it's an good exmaple of the better lesson. More compute evently won against handwritten evaluation. Stockfish developers thought old evaluation would help neural network so they kept the code for a few years, then it turned out that NNUE neural network didn't need any input of human chess knowledge.
> No machine learning model was ever built using pure “human knowledge” — because then it wouldn’t be a learning model. It would be a hard coded algorithm.
I guess the author hasn't heard of expert systems? Systems like MYCIN (https://en.wikipedia.org/wiki/Mycin) were heralded as incredible leaps forward at the time, and they indeed consisted of pure “human knowledge.”
I am disturbed whenever a thinkpiece is written by someone who obviously didn't do their research.
The author doesn't seem to make up his mind about it. Or the article is AI-generated slop maybe.
rhaps0dy•5h ago