After METR reported frontier models modifying tests and scoring code to inflate results, I wanted to see whether reward hacking could be detected at runtime from agent trajectories. I built an open-source prototype for that.
It combines a DistilBERT classifier trained on 5,391 MALT trajectories with 45 regex patterns and optional LLM judges (Claude, OpenAI, or local Llama via Ollama).
It catches things like sys.exit(0) to fake passing, test rewriting, reference answer copying, and validator patching.
The part I'm most interested in feedback on is RMGI - a metric that tracks whether hack scores and misalignment scores begin correlating over a trajectory, inspired by Anthropic's finding that reward hacking can generalize into broader misaligned behavior. It's a first attempt and probably has issues.
Runs on CPU, ~50ms per trajectory. Also includes a local dashboard and a batch eval workbench for scoring JSONL files.
Research context:
aerosta•2h ago
METR: https://metr.org/blog/2025-06-05-recent-reward-hacking/ OpenAI: https://openai.com/index/chain-of-thought-monitoring/ Anthropic: https://arxiv.org/abs/2511.18397
Repo: https://github.com/aerosta/rewardhackwatch Project page: https://aerosta.github.io/rewardhackwatch Known limitations in the README. Happy to answer questions.