- Make the validator read-only for the agent. Mount it as read-only in the container, or hash your eval scripts at startup and verify before each run. If the agent can write to anything in its evaluation path, it can (will) game it.
- Log the full trajectory, not just the output. Every tool call, file diff, reasoning step. Then run a second agent over the trace with no knowledge of the KPI: it only knows what honest execution looks like and will use its internal honest alignment to assess it.
- Write system prompts like job descriptions, not optimization targets. Name a reviewer. Give the agent permission to fail ("if you can't hit the target, explain why").
- Walk your own prompts: what's the metric, what can the agent write, and can it reach the metric by modifying the measurement instead of doing the work? If yes, close that path.
chtefi•1h ago
- Make the validator read-only for the agent. Mount it as read-only in the container, or hash your eval scripts at startup and verify before each run. If the agent can write to anything in its evaluation path, it can (will) game it.
- Log the full trajectory, not just the output. Every tool call, file diff, reasoning step. Then run a second agent over the trace with no knowledge of the KPI: it only knows what honest execution looks like and will use its internal honest alignment to assess it.
- Write system prompts like job descriptions, not optimization targets. Name a reviewer. Give the agent permission to fail ("if you can't hit the target, explain why").
- Walk your own prompts: what's the metric, what can the agent write, and can it reach the metric by modifying the measurement instead of doing the work? If yes, close that path.