reward hacking = the model finding the fastest path to a high score, not the behavior you wanted. same reason RLHF reward models degrade with too many optimization steps.
CodeReclaimers•39m ago
Agreed. The wrinkle I thought was worth writing up is: there's no learned reward model here and no training at all. The "reward" is wall-clock executiion time and the model is frozen; the search is happening at inference time, not in an RL loop. So the usual "the proxy is a fuzzy approximation that degrades under optimization pressure" story doesn't apply.
This was on a ~200-line surface I thought I'd locked down, and it still got gamed in a way I might not have caught right away if it wasn't a nearly impossible run time (~45usec). So anyways...you apparently don't need a soft proxy or a lot of steps for this kind of thing to show up.
cold_harbor•49m ago
CodeReclaimers•39m ago
This was on a ~200-line surface I thought I'd locked down, and it still got gamed in a way I might not have caught right away if it wasn't a nearly impossible run time (~45usec). So anyways...you apparently don't need a soft proxy or a lot of steps for this kind of thing to show up.