Discount factors (γ), n-step returns, GAE λ parameters - these are human priors about temporal abstraction baked directly into the learning signal. PPO's GAE(λ) literally tells the algorithm "here's how far into the future you should care about consequences." We're not learning this, we're imposing it. Different domains need different λ values. That's manual feature engineering, RL-style.
Biological learning doesn't have a global discount factor slider. Dopamine and temporal difference learning in the brain operate at multiple timescales simultaneously - the brain learns which timescales matter for which situations. Our algorithms? They get a single γ parameter tuned by grad students.
Even worse: exploration strategies are domain-specific hacks. ε-greedy for Atari, continuous noise processes for robotics, count-based bonuses for sparse rewards. We're essentially doing "exploration engineering" for each domain, like it's 2012 computer vision all over again.
Compare this to supervised learning circa 2015: we stopped engineering features and just scaled transformers. The architecture learned what mattered. RL in 2025? Still tweaking γ, λ, exploration coefficients, entropy bonuses for every new task.
True bitter-lesson compliance would mean learning your own temporal abstractions (dynamic γ), learning how to explore (meta-RL over exploration strategies), and learning credit assignment windows (adaptive eligibility traces). Some promising directions exist - options frameworks, meta-RL, world models with learned abstraction - but they're not mainstream because they're compute-hungry and unstable. We keep returning to human priors because they're cheaper. That's the opposite of the bitter lesson.
The irony is stark: RL researchers talk about "end-to-end learning" while manually tuning the most fundamental learning signal parameters. Imagine if vision researchers were still manually setting feature detector orientations in 2025. That's where RL is.
I predict: The next major RL breakthrough won't come from better policy gradient estimators. It'll come from algorithms that discover their own temporal abstractions and exploration strategies through meta-learning at scale. Only then will RL be bitter-lesson-pilled.