This is a fantastic reality check on the "LLM introspection" hype. The author's killer insight was to add a simple control question: "Do you believe 1+1=3?".
Turns out, the steering vector that supposedly makes the model "introspect" also makes it agree that 1+1=3. The effect isn't introspection, it's just noise that breaks the model's ability to answer "No."
A great example of how easy it is to see ghosts in the machine without rigorous controls.
ddrauh•2mo ago
Turns out, the steering vector that supposedly makes the model "introspect" also makes it agree that 1+1=3. The effect isn't introspection, it's just noise that breaks the model's ability to answer "No."
A great example of how easy it is to see ghosts in the machine without rigorous controls.