The Bitter Lesson is from the perspective of how to spend your entire career. It is correct over the course of a very long time, and bakes in Moore's Law.
The Bitter Lesson is true because general methods capture these assumed hardware gains that specific methods may not. It was never meant for contrasting methods at a specific moment in time. At a specific moment in time you're just describing Explore vs Exploit.
Except in the last year or two, which is why people are citing it a lot :)
Interestingly, this hasn't happened for wafer fabs. A modern wafer fab costs US$1bn to US$3bn, and there is talk of US$20bn wafer fabs. Around the year 2000, those would have been un-financeable. It was expected that fab cost was going to be a constraint on feature size. That didn't happen.
For years, it was thought that the ASML approach to extreme UV was going to cost too much. It's a horrible hack, shooting off droplets of tin to be vaporized by lasers just to generate soft X-rays. Industry people were hoping for small synchrotrons or X-ray lasers or E-beam machines or something sane. But none of those worked out. Progress went on by making a fundamentally awful process work commercially, at insane cost.
Perhaps we will find something better in the future, but for now awful is the best we got for the cutting edge.
Also, when is cutting edge not the worst it's ever been?
How much of the recent bitter lesson peddling is done by compute salesmen?
How much of it is done by people who can buy a lot of compute?
Deepseek was scandalous for a reason.
I think it's quite a bit more likely for HRM to scale embarrassingly far and outstrip the tons of RLHF and distillation that's been invested in for transformers, more of a bitter lesson 2.0 than anything else.
https://en.wikipedia.org/wiki/Fifth_Generation_Computer_Syst...
However, retrieval is not just google search. Primary key lookups in my db are also retrieval. As are vector index queries or BM25 free text search queries. It's not a general purpose area like compute/search. In summary, i don't think that RAG is dead. Context engineering is just like feature engineering - transform the swamp of data into a structured signal that is easy for in-context learning to learn.
The corollory of all this is it's not just about scaling up agents - giving them more LLMs and more data via MCP. The bitter lesson doesn't apply to agents yet.
[0] https://www.tudelft.nl/en/2025/lr/autonomous-drone-from-tu-d...
https://www.nature.com/articles/s41586-023-06419-4
It's similar with options pricing. The most sophisticated models like multivariate stochastic volatility are computationally expensive to approximate with classical approaches (and have no closed form solution), so just training a small NN on the output of a vast number of simulations of the underlying processes ends up producing a more efficient model than traditional approaches. Same with stuff like trinomial trees.
Yes, it's called distillation.
Throwing a deep network on a problem without some physical insight into the problem has also its disadvantages it seems.
The specific scenario was estimating the orientation of a stationary semi trailer. An objectively measurable number and it was consistently off by 30 deg, yet I was the jerk for suggesting we move from end to end DL to trad Bar Shalom techniques.
That scene isn't for me anymore.
Doesn’t any such claim come with huge caveats — pre specified track/course, no random objects flying between, etc…? ie. train & test distributions are ensured same by ensuring test time can never be more complicated than training data.
Also presumably better sensing than raw visual input.
This well-known critical paper shows examples of AI articles/techniques applied to popular datasets with good-looking results. But, it also demonstrates that, literally, a single line of MATLAB code can outperform some of these techniques: https://arxiv.org/pdf/2009.13807
When AlphaZero came along it blew stockfish out of the water.
Stockfish is a top engine now because besides that initial proof of concept there's no money to be made by throwing compute at Chess.
I think at the moment the best source of data is the chat log, with 1B users and over 1T daily tokens over all LLMs. These chat logs are at the intersection of human interests and LLM execution errors, they are on-policy for the model, right what they need to improve the next iteration.
General-purpose-algorithms-that-scale will beat algorithms that aren't those
The most simple general purpose, scaling algorithm will win, at least over time
Neural networks will win
LLMs will reach AGI with just more resources
This article cites Leela, the chess program, as an example of the Bitter Lesson, as it learns chess using a general method. The article then goes on to cite Stockfish as a counterexample, because it uses human-written heuristics to perform search. However, as you add compute to Stockfish's search, or spend time optimizing compute-expenditure-per-position, Stockfish gets better. Stockfish isn't a counterexample, search is still a part of The Bitter Lesson!
>Any one open to that world...
The "world" in question being a brand of Marxism that's super-explicitly anti-human. No, I'm not kidding or exaggerating.
But its become a lazy crutch for a bunch of people who meet none of those criteria and perverted into a statement more along the lines of "LLMs trained on NVIDIA cards by one of a handful of US companies are guaranteed to outperform every other approach from here to the Singularity".
Nope. Not at all guaranteed, and at the moment? Not even looking likely.
It will have other stuff in it. Maybe that's prediction in representation space like JEPA, maybe its MCTS like Alpha*, maybe its some totally new thing.
And maybe it happens in Hangzhou.
Where I furrow my brow is the casual mixing of philosophical conjecture with technical observations or statements. Mixing the two all too often feels like a crutch around defending either singular perspective in an argument by stating the other half of the argument defends the first half. I know I'm not articulating my point well here, but it just comes off as a little...insincere, I guess? I'm sure someone here will find the appropriate words to communicate my point better, if I'm being understood.
One nitpick on the philosophical side of things I'd point out is that a lot of the resistance to AI replacing human labor is less to do with the self-styled importance of humanity, and more the bleak future of a species where a handful of Capitalists will destroy civilization for the remainder to benefit themselves. That is what sticks in our collective craw, and a large reason for the pushback against AI - and nobody in a position of power is taking that threat remotely seriously, largely because the owners of AI have a vested interest in preventing that from being addressed (since it would inevitably curb the very power they're investing in building for themselves).
Arguably, so is the alternative: explicitly embedding knowledge!
Nothing is immune to GIGO.
Only tangentially related, but this has to be one of the worst metaphors I’ve ever heard. Garbage cans are not typically hotbeds of chaotic activity, unless a raccoon gets in or something.
43% of American workers have used AI at work, they are mostly doing it in informal ways, solving their own work problems. Scaling AI across the enterprise is hard
A lot of firms starting into this business are "betting the farm" on "scaling AI across the enterprise"In my experience LLMs are incredibly useful from a simple text interface (I only work with text, mainly computer code). I am still reeling from how disruptive they are, in that context.
But IMO there is not a lot of money to be made for start ups in that context (I expect there is not enough to justify the high valuations of outfits like Open AI). There should be a name for the curse - revolutionary technology that makes many people vastly more productive, but there is no real way to capture that value. Unless "Scaling AI across the enterprise" can succeed.
I have my doubts. I am sure there will be niches, and in a decade or so, with hindsight, it will be clear what they are. But there is no reliable way to tell now
The "Bitter Lesson" seems like a distraction to me. The fundamental problem is related: this technology is generally useful, much more than it is specifically useful.
The "killer app" is a browser window open to https://chat.deepseek.com. There is not much beyond that. Not nothing, just not much.
But so long as you have not bet your farm on "scaling AI across the enterprise" nor been fired by someone else who is trying, we should be very happy. We are in a "steam engine" moment. Nothing will ever be the same.
And if Open AI and the like all go belly up and demote a swathe of billionaires to be normally rich, that is the cherry on the top
Then, as the article mentions, some new fundamental shift happens, and practitioners need to jump over to a completely new way of working. Monkeypatching to make it all work. Rinse repeat.
Be careful when anyone, even a giant in the field such as Sutton, posits a sweeping claim like this.
My take? Sutton's "bitter lesson" is rather vague and unspecified (i.e. hard to pin down and test) for at least two reasons:
1. The word "ultimately" is squishy, when you think about it. When has enough time passed to make the assessment? At what point can we say e.g. "Problem X has a most effective solution"?
2. What do we mean by "most effective"? There is a lot of variation, including but not limited to (a) some performance metric; (b) data efficiency; (c) flexibility / adaptability across different domains; and (d) energy efficiency.
I'm a big fan of Sutton's work. I've read his RL book cover-to-cover and got to meet him briefly. But, to me, the bitter lesson (as articulated in Sutton's original post) is not even wrong. It is sufficiently open-ended that many of us will disagree about what the lesson is, even before we can get to the empirical questions of "First, has it happened in domain D at time T? Second, is it 'settled' now, or might things change?"
Thirty-five years ago they gave me a Ph.D. basically for pointing out that the controversy du jour -- reactive vs deliberative control for autonomous robots -- was not a dichotomy. You could have the best of both worlds by combining a reactive system with a deliberative one. The reactive system interfaced directly to the hardware on one end and provided essentially a high-level API on the other end that provided primitives like "go that way". It's a little bit more complicated than that because it turns out you need a glue layer in the middle, but the point is: you don't have to choose. The Bitter Lesson is simply a corollary of Ron's First Law: all extreme positions are wrong. So reactive control by itself has limits, and deliberative control by itself has limits. But put the two together (and add some pretty snazzy image processing) and the result is Waymo.
So it was no surprise to me that Stockfish, with its similar approach of combining deliberative search with a small NN computing its quality metric blows everything else out of the water. It has been obvious (at least to me) that this is the right approach for decades now.
I'm actually pretty terrified of the results when the mainstream AI companies finally rediscover this. The capabilities of LLMs are already pretty impressive on their own. If they can get a Stockfish-level boost by combining them with a simple search algorithm the result may very well be the GAI that the rationalist community has been sounding the alarm over for the last 20 years.
o11c•9h ago
andy99•9h ago
terminalshort•9h ago
bigstrat2003•5h ago
thrawa8387336•9h ago
This is about AI, the title is ambiguous.
Despite was used unambiguously wrong.
o11c•8h ago
myhf•7h ago