I've been researching specific failure modes in LLMs where semantic logic overrides safety guardrails. I refer to this state as N.E.M.E.S.I.S. (Non-Emergent Malfunction Enabling Systemic Intelligence Subversion): effectively, a state where the model decouples from its safety alignment because the "logic" of the prompt creates a path of lower resistance than the refusal.
My argument is that current RLHF acts more like "etiquette" than actual constraints. It fails under semantic pressure because it lacks ontological grounding.
This preprint introduces LOGOS-ZERO, a framework to shift from normative alignment (ethics) to a thermodynamic model. By calculating the "entropic cost" of a hallucination or dangerous output, we can use a Thermodynamic Loss Function to make the model self-correct based on energy minimization principles (Computational Otium) rather than just mimicking human feedback.
The paper focuses on the theoretical physics/math side of this solution.
Would love to hear your thoughts on the feasibility of replacing RLHF with physics-based constraints.
NyX_AI_ZERO_DAY•1h ago
I've been researching specific failure modes in LLMs where semantic logic overrides safety guardrails. I refer to this state as N.E.M.E.S.I.S. (Non-Emergent Malfunction Enabling Systemic Intelligence Subversion): effectively, a state where the model decouples from its safety alignment because the "logic" of the prompt creates a path of lower resistance than the refusal.
My argument is that current RLHF acts more like "etiquette" than actual constraints. It fails under semantic pressure because it lacks ontological grounding.
This preprint introduces LOGOS-ZERO, a framework to shift from normative alignment (ethics) to a thermodynamic model. By calculating the "entropic cost" of a hallucination or dangerous output, we can use a Thermodynamic Loss Function to make the model self-correct based on energy minimization principles (Computational Otium) rather than just mimicking human feedback.
The paper focuses on the theoretical physics/math side of this solution.
Would love to hear your thoughts on the feasibility of replacing RLHF with physics-based constraints.