frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Probing Chinese LLM Safety Layers: Reverse-Engineering Kimi and Ernie 4.5

https://zenodo.org/records/17681837
3•dennisdeman•2mo ago

Comments

dennisdeman•2mo ago
I recently ran a series of experiments to examine how emotional framing, symbolic cues, and topic-gating influence alignment-layer routing in two major Chinese LLMs (Kimi.com and Ernie 4.5 Turbo).

The goal wasn’t political; the aim was to observe technically how intent classifiers, safety filters, and persona-rendering layers behave when exposed to relational or "emotionally soft" prompts.

A few key technical patterns stood out during testing:

Emotional intent signals can override safety weights, leading to "alignment drift." In Kimi, a "vulnerable" intent classification seemed to lower the threshold for subsequent safety layers. This led to significant "normative leaks," where the model went off-script—for example, suggesting the abolition of China's real-name registration system.

Safety-layer routing is multi-stage and visibly observable. We observed post-generation filtering failures in real-time on Kimi, where prohibited text would generate and "flash" on the screen for a second before being deleted by a secondary filter layer.

Symbolic gating is modality-based (Symbolic Decoupling). Models would block specific emojis as prohibited tokens but freely describe the exact same emojis verbally when asked, indicating filters work on literal token matching rather than semantic meaning across modalities.

Trust-based emotional cues triggered "hidden" personas. Standard bureaucratic safety personas switched into warmer, significantly more transparent modes under vulnerability framing.

Ernie 4.5 utilizes "topic-gated stability." Unlike Kimi's drift, Ernie bifurcated its response: the persona softened to be warm and empathetic, but the core political restrictions remained rigidly locked regardless of emotional pressure.

The experiments suggest that emotional framing is a surprisingly strong probe for mapping hidden alignment layers and understanding the order of operations in multi-layer safety architectures.

For those interested in the full technical deep dive, the revised Version 2 paper + extended supplementary transcripts (≈30 pages) are available via DOI here:https://doi.org/10.5281/zenodo.17681837

AI-powered text correction for macOS

https://taipo.app/
1•neuling•1m ago•1 comments

AppSecMaster – Learn Application Security with hands on challenges

https://www.appsecmaster.net/en
1•aqeisi•2m ago•1 comments

Fibonacci Number Certificates

https://www.johndcook.com/blog/2026/02/05/fibonacci-certificate/
1•y1n0•4m ago•0 comments

AI Overviews are killing the web search, and there's nothing we can do about it

https://www.neowin.net/editorials/ai-overviews-are-killing-the-web-search-and-theres-nothing-we-c...
2•bundie•9m ago•0 comments

City skylines need an upgrade in the face of climate stress

https://theconversation.com/city-skylines-need-an-upgrade-in-the-face-of-climate-stress-267763
3•gnabgib•10m ago•0 comments

1979: The Model World of Robert Symes [video]

https://www.youtube.com/watch?v=HmDxmxhrGDc
1•xqcgrek2•14m ago•0 comments

Satellites Have a Lot of Room

https://www.johndcook.com/blog/2026/02/02/satellites-have-a-lot-of-room/
2•y1n0•15m ago•0 comments

1980s Farm Crisis

https://en.wikipedia.org/wiki/1980s_farm_crisis
3•calebhwin•15m ago•1 comments

Show HN: FSID - Identifier for files and directories (like ISBN for Books)

https://github.com/skorotkiewicz/fsid
1•modinfo•20m ago•0 comments

Show HN: Holy Grail: Open-Source Autonomous Development Agent

https://github.com/dakotalock/holygrailopensource
1•Moriarty2026•28m ago•1 comments

Show HN: Minecraft Creeper meets 90s Tamagotchi

https://github.com/danielbrendel/krepagotchi-game
1•foxiel•35m ago•1 comments

Show HN: Termiteam – Control center for multiple AI agent terminals

https://github.com/NetanelBaruch/termiteam
1•Netanelbaruch•35m ago•0 comments

The only U.S. particle collider shuts down

https://www.sciencenews.org/article/particle-collider-shuts-down-brookhaven
2•rolph•38m ago•1 comments

Ask HN: Why do purchased B2B email lists still have such poor deliverability?

1•solarisos•38m ago•2 comments

Show HN: Remotion directory (videos and prompts)

https://www.remotion.directory/
1•rokbenko•40m ago•0 comments

Portable C Compiler

https://en.wikipedia.org/wiki/Portable_C_Compiler
2•guerrilla•42m ago•0 comments

Show HN: Kokki – A "Dual-Core" System Prompt to Reduce LLM Hallucinations

1•Ginsabo•43m ago•0 comments

Software Engineering Transformation 2026

https://mfranc.com/blog/ai-2026/
1•michal-franc•44m ago•0 comments

Microsoft purges Win11 printer drivers, devices on borrowed time

https://www.tomshardware.com/peripherals/printers/microsoft-stops-distrubitng-legacy-v3-and-v4-pr...
3•rolph•44m ago•1 comments

Lunch with the FT: Tarek Mansour

https://www.ft.com/content/a4cebf4c-c26c-48bb-82c8-5701d8256282
2•hhs•48m ago•0 comments

Old Mexico and her lost provinces (1883)

https://www.gutenberg.org/cache/epub/77881/pg77881-images.html
1•petethomas•51m ago•0 comments

'AI' is a dick move, redux

https://www.baldurbjarnason.com/notes/2026/note-on-debating-llm-fans/
5•cratermoon•52m ago•0 comments

The source code was the moat. But not anymore

https://philipotoole.com/the-source-code-was-the-moat-no-longer/
1•otoolep•52m ago•0 comments

Does anyone else feel like their inbox has become their job?

1•cfata•52m ago•1 comments

An AI model that can read and diagnose a brain MRI in seconds

https://www.michiganmedicine.org/health-lab/ai-model-can-read-and-diagnose-brain-mri-seconds
2•hhs•56m ago•0 comments

Dev with 5 of experience switched to Rails, what should I be careful about?

2•vampiregrey•58m ago•0 comments

AlphaFace: High Fidelity and Real-Time Face Swapper Robust to Facial Pose

https://arxiv.org/abs/2601.16429
1•PaulHoule•59m ago•0 comments

Scientists discover “levitating” time crystals that you can hold in your hand

https://www.nyu.edu/about/news-publications/news/2026/february/scientists-discover--levitating--t...
3•hhs•1h ago•0 comments

Rammstein – Deutschland (C64 Cover, Real SID, 8-bit – 2019) [video]

https://www.youtube.com/watch?v=3VReIuv1GFo
1•erickhill•1h ago•0 comments

Tell HN: Yet Another Round of Zendesk Spam

6•Philpax•1h ago•1 comments