I created a super-comprehensive GRPO Tutorial, where you can learn to code up your own GRPO Trainer. In addition to the code I also go through the math and highlight some of the interesting properties of the GRPO loss like why the loss will always be 0 at step 0 and why the policy loss will always be 0 when the num iterations (mu) hyper-parameter is set to 1. Appreciate any feedback!
Pramodith•3h ago