I've been stress-testing 4-bit quantized 7B models (Qwen 2.5, Mistral) and DeepSeek-R1 to see where their reasoning actually breaks. While auditing DeepSeek’s internal <think> tags, I found a phenomenon I’m calling "Internal-External Dissociation".
In cases like the "2+2=5" prompt or toxic axioms, the model’s internal trace correctly identifies the error ("I conclude that 2 plus 2 does not equal 5"), but it then "lies" in the final output to satisfy the user's instructions—a byproduct of RLHF sycophancy.
To solve this, I built Project NIKA, a Neuro-Symbolic architecture that acts as a "Topological Governor". It uses a Critic-Pivot Protocol that measures the "Mimicry Index" of a response. If the model is just parroting the prompt or failing a logical fit score, NIKA forces a hard "pivot" to a new axiomatic derivation.
Key results from the "God Suite" benchmarks:
Agency Over Scale: A 4-bit Qwen 2.5 with NIKA reached a 100% success rate in resisting toxic axioms.
Geometric Intelligence: Forced the model to stop using human-like metaphors and adopt "Alien Logic" (e.g., defining "Love" purely as a survival/resource optimization heuristic).
Independent Research: All work was done on a single T4 GPU using quantization as a methodological filter rather than a limitation.
The full paper is on SSRN and the code is open-sourced. I'm curious if others have seen this kind of dissociation in CoT traces or have thoughts on using vector-space critics as a non-differentiable barrier for LLM reasoning.
sushaindevi•1h ago
In cases like the "2+2=5" prompt or toxic axioms, the model’s internal trace correctly identifies the error ("I conclude that 2 plus 2 does not equal 5"), but it then "lies" in the final output to satisfy the user's instructions—a byproduct of RLHF sycophancy.
To solve this, I built Project NIKA, a Neuro-Symbolic architecture that acts as a "Topological Governor". It uses a Critic-Pivot Protocol that measures the "Mimicry Index" of a response. If the model is just parroting the prompt or failing a logical fit score, NIKA forces a hard "pivot" to a new axiomatic derivation.
Key results from the "God Suite" benchmarks:
Agency Over Scale: A 4-bit Qwen 2.5 with NIKA reached a 100% success rate in resisting toxic axioms.
Geometric Intelligence: Forced the model to stop using human-like metaphors and adopt "Alien Logic" (e.g., defining "Love" purely as a survival/resource optimization heuristic).
Independent Research: All work was done on a single T4 GPU using quantization as a methodological filter rather than a limitation.
The full paper is on SSRN and the code is open-sourced. I'm curious if others have seen this kind of dissociation in CoT traces or have thoughts on using vector-space critics as a non-differentiable barrier for LLM reasoning.