Probing Chinese LLM Safety Layers: Reverse-Engineering Kimi and Ernie 4.5

3•dennisdeman•2mo ago

Comments

dennisdeman•2mo ago

I recently ran a series of experiments to examine how emotional framing, symbolic cues, and topic-gating influence alignment-layer routing in two major Chinese LLMs (Kimi.com and Ernie 4.5 Turbo).

The goal wasn’t political; the aim was to observe technically how intent classifiers, safety filters, and persona-rendering layers behave when exposed to relational or "emotionally soft" prompts.

A few key technical patterns stood out during testing:

Emotional intent signals can override safety weights, leading to "alignment drift." In Kimi, a "vulnerable" intent classification seemed to lower the threshold for subsequent safety layers. This led to significant "normative leaks," where the model went off-script—for example, suggesting the abolition of China's real-name registration system.

Safety-layer routing is multi-stage and visibly observable. We observed post-generation filtering failures in real-time on Kimi, where prohibited text would generate and "flash" on the screen for a second before being deleted by a secondary filter layer.

Symbolic gating is modality-based (Symbolic Decoupling). Models would block specific emojis as prohibited tokens but freely describe the exact same emojis verbally when asked, indicating filters work on literal token matching rather than semantic meaning across modalities.

Trust-based emotional cues triggered "hidden" personas. Standard bureaucratic safety personas switched into warmer, significantly more transparent modes under vulnerability framing.

Ernie 4.5 utilizes "topic-gated stability." Unlike Kimi's drift, Ernie bifurcated its response: the persona softened to be warm and empathetic, but the core political restrictions remained rigidly locked regardless of emotional pressure.

The experiments suggest that emotional framing is a surprisingly strong probe for mapping hidden alignment layers and understanding the order of operations in multi-layer safety architectures.

For those interested in the full technical deep dive, the revised Version 2 paper + extended supplementary transcripts (≈30 pages) are available via DOI here:https://doi.org/10.5281/zenodo.17681837

AI-powered text correction for macOS

AppSecMaster – Learn Application Security with hands on challenges

Fibonacci Number Certificates

AI Overviews are killing the web search, and there's nothing we can do about it

City skylines need an upgrade in the face of climate stress

1979: The Model World of Robert Symes [video]

Satellites Have a Lot of Room

1980s Farm Crisis

Show HN: FSID - Identifier for files and directories (like ISBN for Books)

Show HN: Holy Grail: Open-Source Autonomous Development Agent

Show HN: Minecraft Creeper meets 90s Tamagotchi

Show HN: Termiteam – Control center for multiple AI agent terminals

The only U.S. particle collider shuts down

Ask HN: Why do purchased B2B email lists still have such poor deliverability?

Show HN: Remotion directory (videos and prompts)

Portable C Compiler

Show HN: Kokki – A "Dual-Core" System Prompt to Reduce LLM Hallucinations

Software Engineering Transformation 2026

Microsoft purges Win11 printer drivers, devices on borrowed time

Lunch with the FT: Tarek Mansour

Old Mexico and her lost provinces (1883)

'AI' is a dick move, redux

The source code was the moat. But not anymore

Does anyone else feel like their inbox has become their job?

An AI model that can read and diagnose a brain MRI in seconds

Dev with 5 of experience switched to Rails, what should I be careful about?

AlphaFace: High Fidelity and Real-Time Face Swapper Robust to Facial Pose

Scientists discover “levitating” time crystals that you can hold in your hand

Rammstein – Deutschland (C64 Cover, Real SID, 8-bit – 2019) [video]

Tell HN: Yet Another Round of Zendesk Spam