Funny how this title follows Betteridge's law of headlines, in this case demonstrating that RLVR (Reinforcement Learning with Verifiable Rewards) doesn't help the model generalize, but rather seems to overfit it, reducing the overall reasoning capacity.
yorwba•2h ago