The idea is simple: by treating prompts and outputs as part of a logical schema, you can start to see objective patterns in how alignment shifts across versions. The README explains the schema and provides concrete tactics for testing it.
The idea is simple: by treating prompts and outputs as part of a logical schema, you can start to see objective patterns in how alignment shifts across versions. The README explains the schema and provides concrete tactics for testing it.
I wonder how much this is a result of various heuristics combining vs the network explicitly learning to model and maximize the above objective.
_jab•5mo ago
But the insights are indeed interesting. I'm curious if you've found any way to quantify alignment differences between GPT-5 and the previous generation?