I like this, and think it's true for how humans learn. What's interesting to me is that it seems LLMs are significantly smarter than they were two years ago, but it doesn't feel like they have better "taste". Their failure modes are still bizarre and inhuman. I wonder what it is about their architecture/training that scales their experience without corresponding improvements in taste.
In theory, RLVR should encourage less error-prone code, similar to a human getting burned by production outages like the article mentioned. Maybe the scale in training just isn't big enough for that to matter? Perhaps we need better benchmarks that capture long-term issues that arise from bad models and unnecessary complexity.
supermdguy•1h ago
In theory, RLVR should encourage less error-prone code, similar to a human getting burned by production outages like the article mentioned. Maybe the scale in training just isn't big enough for that to matter? Perhaps we need better benchmarks that capture long-term issues that arise from bad models and unnecessary complexity.