That's just natural reward hacking when you have no training/constraints for readability. IIRC R1 Zero is like that too, they retrained it with a bit of SFT to keep it readable and called it R1. Hallucinating training examples if you break the format or prompt it with nothing is also pretty standard behavior.
flabber•8h ago
striking•2h ago
k310•27m ago
mac-attack•1h ago