frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Probing Chinese LLM Safety Layers: Reverse-Engineering Kimi and Ernie 4.5

https://zenodo.org/records/17681837
2•dennisdeman•1h ago

Comments

dennisdeman•1h ago
I recently ran a series of experiments to examine how emotional framing, symbolic cues, and topic-gating influence alignment-layer routing in two major Chinese LLMs (Kimi.com and Ernie 4.5 Turbo).

The goal wasn’t political; the aim was to observe technically how intent classifiers, safety filters, and persona-rendering layers behave when exposed to relational or "emotionally soft" prompts.

A few key technical patterns stood out during testing:

Emotional intent signals can override safety weights, leading to "alignment drift." In Kimi, a "vulnerable" intent classification seemed to lower the threshold for subsequent safety layers. This led to significant "normative leaks," where the model went off-script—for example, suggesting the abolition of China's real-name registration system.

Safety-layer routing is multi-stage and visibly observable. We observed post-generation filtering failures in real-time on Kimi, where prohibited text would generate and "flash" on the screen for a second before being deleted by a secondary filter layer.

Symbolic gating is modality-based (Symbolic Decoupling). Models would block specific emojis as prohibited tokens but freely describe the exact same emojis verbally when asked, indicating filters work on literal token matching rather than semantic meaning across modalities.

Trust-based emotional cues triggered "hidden" personas. Standard bureaucratic safety personas switched into warmer, significantly more transparent modes under vulnerability framing.

Ernie 4.5 utilizes "topic-gated stability." Unlike Kimi's drift, Ernie bifurcated its response: the persona softened to be warm and empathetic, but the core political restrictions remained rigidly locked regardless of emotional pressure.

The experiments suggest that emotional framing is a surprisingly strong probe for mapping hidden alignment layers and understanding the order of operations in multi-layer safety architectures.

For those interested in the full technical deep dive, the revised Version 2 paper + extended supplementary transcripts (≈30 pages) are available via DOI here:https://doi.org/10.5281/zenodo.17681837

Jony Ive and Sam Altman say they have an AI hardware prototype

https://www.theverge.com/news/827607/openai-hardware-prototype-chatgpt-jony-ive-sam-altman
1•pseudolus•2m ago•0 comments

The Most Selective Tech Companies

https://www.jointaro.com/interviews/hardest/
1•jenthoven•7m ago•0 comments

Seeing a Molecule's Quantum Shadow

https://physics.aps.org/articles/v18/s149
1•lc0_stein•7m ago•0 comments

Hola

https://vdpxl0-ip-94-125-136-249.tunnelmole.net
1•jhhjfg•12m ago•0 comments

Major insurers move to avoid liability for AI lawsuits

https://www.tomshardware.com/tech-industry/artificial-intelligence/insurers-move-to-limit-ai-liab...
2•pseudolus•13m ago•1 comments

Reallocating demand from closed models to open models would reduce prices by 70%

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5767103
3•nreece•15m ago•0 comments

Show HN: Pastehub.app – 100 % client-side toolbox for whatever you just copied

https://www.pastehub.app/
1•jcfs•16m ago•0 comments

Fighting for 'The Right to Night' Under Starry, Rural Skies

https://www.nytimes.com/2025/11/24/science/astronomy-michigan-dark-sky.html
2•andsoitis•20m ago•0 comments

Ask HN: What work problems would your company pay to solve?

3•aryanchaurasia•20m ago•0 comments

Autobase 2.5 – Expert Mode brings advanced configuration right into the UI

https://www.postgresql.org/about/news/autobase-250-released-3176/
1•tamnd•22m ago•0 comments

SuperDuper Security Update v3.11

https://www.shirt-pocket.com/blog/index.php/shadedgrey/comments/superduper_security_update_v311/
2•colinprince•30m ago•0 comments

Rachel Reeves plans £7.5B tax rise in budget after U-turn on income tax rates

https://www.theguardian.com/business/2025/nov/14/uk-borrowing-costs-up-after-markets-spooked-by-r...
3•PaulHoule•30m ago•1 comments

CoreWeave: Where the AI and Private Credit Bubbles Collide

https://seekingalpha.com/article/4847246-coreweave-where-the-ai-and-private-credit-bubbles-collide
1•zerosizedweasle•32m ago•0 comments

Cheese Consumption and Dementia in Older Japanese Adults: The Jages Cohort Study

https://www.mdpi.com/2072-6643/17/21/3363
4•gnabgib•33m ago•1 comments

Show HN: A safe way to capture and create branded profile photos for your site

https://www.memoreco.com/explainers/branded-avatars
1•andupotorac•40m ago•0 comments

More Americans are getting their power shut off, as unpaid bills pile up

https://www.washingtonpost.com/business/2025/11/24/power-shutoffs-surge-electric-bills/
6•doener•41m ago•0 comments

Using AI as a Render Engine

https://cap.so/s/ggjjcek0wpymybd
2•makosst•43m ago•1 comments

Show HN: Housepoints

https://whirl.digital/housepoints.html
1•jamesdhutton•44m ago•0 comments

Praise Amazon for raising this service from the dead

https://www.theregister.com/2025/11/24/praise_amazon_for_reviving_codecommit_corey_quinn/
2•mooreds•46m ago•0 comments

Gemini 3 vs. Opus 4.5

3•samsilva•46m ago•0 comments

Show HN: TX-2 ECS – A web framework that treats your app as a world

https://www.tx-2.dev/
1•iregaddr•52m ago•0 comments

World Institute of Kimchi – Kimchi stimulates and regulates immune response

https://medicalxpress.com/news/2025-11-kimchi-precision-immune-week-clinical.html
3•Gaishan•52m ago•1 comments

Longshot Space interview – Space cannon launcher

https://newatlas.com/space/interview-longshot-space-mike-grace/
2•Gaishan•56m ago•0 comments

The State of TanStack, Two Years of Full-Time OSS

https://tanstack.com/blog/tanstack-2-years
2•coloneltcb•57m ago•0 comments

Social Media Detox and Youth Mental Health

https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2841773?guestAccessKey=1b34668e-afe8...
2•pseudolus•1h ago•0 comments

"Genesis Mission" to boost AI research

https://www.axios.com/2025/11/24/trump-ai-genesis-mission-doe-chris-wright
1•andsoitis•1h ago•1 comments

GitLab discovers widespread NPM supply chain attack

https://about.gitlab.com/blog/gitlab-discovers-widespread-npm-supply-chain-attack/
1•soheilpro•1h ago•1 comments

LLM APIs Are a Synchronization Problem

https://lucumr.pocoo.org/2025/11/22/llm-apis/
1•yakkomajuri•1h ago•0 comments

The Hundred-Year Language (2003)

https://www.paulgraham.com/hundred.html
2•swatson741•1h ago•0 comments

Bullsquid.com

https://bullsquid.com/
1•sergiotapia•1h ago•0 comments