I personally think that Gemini 2.5 Pro's superiority comes from having hundreds or thousands RL tasks (without any proof whatsoever, so rather a feeling). So I've been wanting a "RL Zoo" for quite a while. I hope this project won't be a one-off and will be maintained long term with many external contributions to add new targets!
Given that GDM pioneered RL, that's a reasonable assumption
RL was established, at the latest, with Q-learning in 1989: https://en.wikipedia.org/wiki/Q-learning
i still think my original statement is fair
people who knew from context that your statement was broadly not actually right would know what you mean and agree on vibes. people who didn't could reasonably be misled, i think.
gemini 2.5 pro shines for 200k+ tokens
does this mean that previous RL papers claiming the opposite were possibly bottlenecked by small datasets?
Their results are consistent with novel reasoning strategies, but they're also consistent with more reliable execution of reasoning strategies that the base model can generate in principle, but rarely succeeds at due to a large number of steps. (If you have a model that can do each step independently with 99% success rate and getting the correct result requires 1000 steps, the chance of making it all the way to the end without a single error is only about 0.004%.)
I agree with your general point though. Ie, we need more thorough empirical investigation of how reasoning behavior evolves during RL training starting from the base model. And, current RL training results seem more like "amplifying existing good behavior" than "inducing emergent good behavior".
For a novel reasoning strategy, I would expect at least a few individual tokens where the base model assigns much smaller probabilities than the reinforcement-learning trained one, as opposed to just being a little smaller but spread out over many tokens. (Which would better fit a "death by a thousand cuts" scenario.)
>Spurious Rewards: Rethinking Training Signals in RLVR ### *TL;DR* We show that you can do RLVR on Qwen2.5-Math models with *completely random or incorrect rewards*, and still get massive math benchmark gains.
All of the following spurious rewards give 15-20+ points on MATH-500 when RLVR training Qwen2.5-Math-7B:
- RLVR + format reward (reward responses with `\boxed{}`): *+16.4%* - RLVR + incorrect reward (only incorrect answers rewarded): *+24.6%* - RLVR + random reward: *+21.4%* - (as a reference) RLVR + ground-truth reward: + 28.8%
How can these spurious rewards possibly work? Can we get similar gains on other models with broken rewards?
>Learning to Reason without External Rewards Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. [2]
[1] https://rethink-rlvr.notion.site/Spurious-Rewards-Rethinking... [2] https://arxiv.org/abs/2505.19590
> How can these spurious rewards possibly work? Can we get similar gains on other models with broken rewards?
it's because in those cases, RLVR merely elicits the reasoning strategies already contained in the model through pre-training
this paper, which uses Reasoning gym, shows that you need to train for way longer than those papers you mentioned to actually uncover novel reasoning strategies: https://arxiv.org/abs/2505.24864
So the reward value shifting may act as a sort of unintentional regularization technique (similar to adding noise to the discriminator input in GAN archs).
Prejudices is a form of overfitting IMHO
Part of the aim of RG is to be used as a difficulty-adjustable & non-repeating eval though so if people think it's a good benchmark, perhaps it will allow this status quo to shift!
starzmustdie•1d ago