In my last post, I found a critical bug in my PPO agent. Fixing it was the "right" thing to do, but it tanked my agent's performance from a score of 84 all the way down to 9.
This post is the forensic investigation into why that bug was so helpful. I started with a simple hypothesis that it was just adding random noise for exploration, which turned out to be partially correct but didn't explain the whole story.
The real "secret sauce" was that the bug was adding correlated noise, creating a consistent optimistic or pessimistic bias for an entire trajectory. I managed to reverse-engineer this effect into a new, principled technique that successfully reproduced the 84 score.
The post is the full deep dive, from visualizing the original bug's signal to designing a new form of state-dependent exploration from scratch. Happy to answer any questions about the process or the JAX/Flax implementation.
wmaxlees•2h ago
In my last post, I found a critical bug in my PPO agent. Fixing it was the "right" thing to do, but it tanked my agent's performance from a score of 84 all the way down to 9.
This post is the forensic investigation into why that bug was so helpful. I started with a simple hypothesis that it was just adding random noise for exploration, which turned out to be partially correct but didn't explain the whole story.
The real "secret sauce" was that the bug was adding correlated noise, creating a consistent optimistic or pessimistic bias for an entire trajectory. I managed to reverse-engineer this effect into a new, principled technique that successfully reproduced the 84 score.
The post is the full deep dive, from visualizing the original bug's signal to designing a new form of state-dependent exploration from scratch. Happy to answer any questions about the process or the JAX/Flax implementation.