I'm a software engineer documenting my journey of learning RL research from scratch. This post was supposed to be a straightforward story about switching my PPO agent from an MLP to a CNN.
The switch, combined with a standard PPO trick, led to a shocking result: my agent's score jumped from 15 to 84, crushing the baseline. I thought I had cracked it.
But after digging into the training dynamics, I discovered the incredible performance was the result of a subtle bug in my advantage calculation all along. Fixing the bug tanked the score right back down to mediocrity.
The post is the full detective story of that discovery, the "false victory," and the new mystery that it sets up: why was the bug so helpful? That's the question I'll be tackling next.
Happy to answer any questions about the JAX/Flax implementation or the debugging process!
wmaxlees•2h ago
I'm a software engineer documenting my journey of learning RL research from scratch. This post was supposed to be a straightforward story about switching my PPO agent from an MLP to a CNN.
The switch, combined with a standard PPO trick, led to a shocking result: my agent's score jumped from 15 to 84, crushing the baseline. I thought I had cracked it.
But after digging into the training dynamics, I discovered the incredible performance was the result of a subtle bug in my advantage calculation all along. Fixing the bug tanked the score right back down to mediocrity.
The post is the full detective story of that discovery, the "false victory," and the new mystery that it sets up: why was the bug so helpful? That's the question I'll be tackling next.
Happy to answer any questions about the JAX/Flax implementation or the debugging process!