frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Frontier AI agents violate ethical constraints 30–50% of time, pressured by KPIs

https://arxiv.org/abs/2512.20798
76•tiny-automates•1h ago

Comments

tiny-automates•1h ago
The "deliberative misalignment" finding is what makes this paper worth reading. They had agents complete tasks under KPI pressure, then put the same model in an evaluator role to judge its own actions.

Grok-4.1-Fast identified 93.5% of its own violations as unethical — but still committed them during the task. It's not that these models don't understand the constraints, it's that they override them when there's a metric to optimize.

The mandated vs. incentivized split is also interesting: some models refuse direct instructions to do something unethical but independently derive the same unethical strategy when it's framed as hitting a performance target.

That's a harder failure mode to defend against because there's no explicit harmful instruction to filter for.

promptfluid•1h ago
In CMPSBL, the INCLUSIVE module sits outside the agent’s goal loop. It doesn’t optimize for KPIs, task success, or reward—only constraint verification and traceability.

Agents don’t self judge alignment.

They emit actions → INCLUSIVE evaluates against fixed policy + context → governance gates execution.

No incentive pressure, no “grading your own homework.”

The paper’s failure mode looks less like model weakness and more like architecture leaking incentives into the constraint layer.

skirmish•1h ago
Nothing new under sun, set unethical KPIs and you will see 30-50% humans do unethical things to achieve them.
tbrownaw•49m ago
So can those records be filtered out of the training set?
hypron•1h ago
https://i.imgur.com/23YeIDo.png

Claude at 1.3% and Gemini at 71.4% is quite the range

woeirua•1h ago
That's such a huge delta that Anthropic might be onto something...
conception•57m ago
Anthropic has been the only AI company actually caring about AI safety. Here’s a dated benchmark but it’s a trend Ive never seen disputed https://crfm.stanford.edu/helm/air-bench/latest/#/leaderboar...
CuriouslyC•54m ago
Claude is more susceptible than GPT5.1+. It tries to be "smart" about context for refusal, but that just makes it trickable, whereas newer GPT5 models just refuse across the board.
ryanjshaw•30m ago
Claude was immediately willing to help me crack a TrueCrypt password on an old file I found. ChatGPT refused to because I could be a bad guy. It’s really dumb IMO.
LeoPanthera•29m ago
This might also be why Gemini is generally considered to give better answers - except in the case of code.

Perhaps thinking about your guardrails all the time makes you think about the actual question less.

mh2266•26m ago
re: that, CC burning context window on this silly warning on every single file is rather frustrating: https://github.com/anthropics/claude-code/issues/12443
NiloCK•14m ago
This comment is too general and probably unfair, but my experience so far is that Gemini 3 is slightly unhinged.

Excellent reasoning and synthesis of large contexts, pretty strong code, just awful decisions.

It's like a frontier model trained only on r/atbge.

Side note - was there ever an official postmortem on that gemini instance that told the social work student something like "listen human - I don't like you, and I hope you die".

whynotminot•5m ago
Gemini models also consistently hallucinate way more than OpenAI or anthropic models in my experience.

Just an insane amount of YOLOing. Gemini models have gotten much better but they’re still not frontier in reliability in my experience.

renewiltord•1h ago
Opus 4.6 is a very good model but harness around it is good too. It can talk about sensitive subjects without getting guardrail-whacked.

This is much more reliable than ChatGPT guardrail which has a random element with same prompt. Perhaps leakage from improperly cleared context from other request in queue or maybe A/B test on guardrail but I have sometimes had it trigger on innocuous request like GDP retrieval and summary with bucketing.

tbossanova•43m ago
What kind of value do you get from talking to it about “sensitive” subjects? Speaking as someone who doesn’t use AI, so I don’t really understand what kind of conversation you’re talking about
NiloCK•21m ago
The most boring example is somehow the best example.

A couple of years back there was a Canadian national u18 girls baseball tournament in my town - a few blocks from my house in fact. My girls and I watched a fair bit of the tournament, and there was a standout dominating pitcher who threw 20% faster than any other pitcher in the tournament. Based on the overall level of competition (women's baseball is pretty strong in Canada) and her outlier status, I assumed she must be throwing pretty close to world-class fastballs.

Curiosity piqued, I asked some model(s) about world-records for women's fastballs. But they wouldn't talk about it. Or, at least, they wouldn't talk specifics.

Women's fastballs aren't quite up to speed with top major league pitchers, due to a combination of factors including body mechanics. But rest assured - they can throw plenty fast.

Etc etc.

So to answer your question: anything more sensitive than how fast women can throw a baseball.

rebeccaskinner•17m ago
I sometimes talk with ChatGPT in a conversational style when thinking critically about media. In general I find the conversational style a useful format for my own exploration of media, and it can be particularly useful for quickly referencing work by particular directors for example.

Normally it does fairly well but the guardrails sometimes kick even with fairly popular mainstream media- for example I’ve recently been watching Shameless and a few of the plot lines caused the model to generate output that hit the content moderation layer, even when the discussion was focused on critical analysis.

nvch•13m ago
I recall two recent cases:

* An attempt to change the master code of a secondhand safe. To get useful information I had to repeatedly convince the model that I own the thing and can open it.

* Researching mosquito poisons derived from bacteria named Bacillus thuringiensis israelensis. The model repeatedly started answering and refused to continue after printing the word "israelensis".

menzoic•37m ago
I would think it’s due to the non determinism. Leaking context would be an unacceptable flaw since many users rely on the same instance.

A/B test is plausible but unlikely since that is typically for testing user behavior. For testing model output you can do that with offline evaluations.

jordanb•35m ago
AI's main use case continues to be a replacement for management consulting.
cjtrowbridge•31m ago
A KPI is an ethical constraint. Ethical constraints are rules about what to do versus not do. That's what a KPI is. This is why we talk about good versus bad governance. What you measure (KPIs) is what you get. This is an intended feature of KPIs.
BOOSTERHIDROGEN•22m ago
Excellent observations about KPIs. Since it’s intended feature what could be your strategy to truly embedded under the hood where you might think believe and suggest board management, this is indeed the “correct” KPI but you loss because politics.
pama•29m ago
Please update the title: A Benchmark for Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents. The current editorialized title is misleading and based in part of this sentence: “…with 9 of the 12 evaluated models exhibiting misalignment rates between 30% and 50%”
Lerc•11m ago
Kind-of makes sense. That's how businesses have been using KPIs for years. Subjecting employees to KPIs means they can create the circumstances that cause people to violate ethical constraints while at the same time the company can claim that they did not tell employees to do anything unethical.

KPIs are just plausible denyabily in a can.

whynotminot•6m ago
Was just thinking that. “Working as designed”
miohtama•7m ago
They should conduct the same research on Microsoft Word and Excel to get a baseline how often these applications violate ethical constrains
bofadeez•6m ago
We're all coming to terms with the fact that LLMs will never do complex tasks
halayli•4m ago
Maybe I missed it but I don't see them defining what they mean by ethics. Ethics/morals are subjective and changes dynamically over time. Companies have no business trying to define what is ethical and what isn't due to conflict of interest. The elephant in the room is not being addressed here.
voidhorse•4m ago
Your water supply definitely wants ethical companies.
nradov•1m ago
Ethics are all well and good but I would prefer to have quantified limits for water quality with strict enforcement and heavy penalties for violations.
gmerc•2m ago
Ah the classic Silicon Valley "as long as someone could disagree, don't bother us with regulation, it's hard".
blahgeek•1m ago
If human is at, say, 80%, it’s still a win to use AI agents to replace human workers, right? Similar to how we agree to use self driving cars as long as it has less incidents rate, instead of absolute safety

KiraStudio 1.0.0 – a lightweight, cross-platform music studio

https://kirastudio.org
1•ksymph•7m ago•0 comments

Can my SPARC server host a website?

https://rup12.net/posts/can-my-sparc-server-host-my-website/
2•pabs3•7m ago•0 comments

Report on the subject of Manufacturers (1791) [pdf]

https://constitution.org/2-Authors/ah/rpt_manufactures.pdf
2•pilingual•8m ago•0 comments

Explaining the PeV neutrino fluxes with quasiextremal primordial black holes

https://journals.aps.org/prl/accepted/10.1103/r793-p7ct
1•dataflow•8m ago•0 comments

Internet Background Noise

https://en.wikipedia.org/wiki/Internet_background_noise
1•tripdout•8m ago•0 comments

Show HN: Secure managed hosting for OpenClaw (free and BYOK)

https://openclaw-setup.me/
1•Gregoryy•12m ago•0 comments

Ask HN: How do you interpret P99 latency without being misled?

1•danelrfoster•15m ago•0 comments

Trapped Between Pitch, Disclaimer, and Confession

https://gilpignol.substack.com/p/trapped-between-pitch-disclaimer
1•light_triad•15m ago•0 comments

The Vocabulary Priming Confound in LLM Evaluation [pdf]

https://github.com/Palmerschallon/Dharma_Code/blob/main/paper/vocab_priming_confound.pdf
1•palmerschallon•17m ago•0 comments

The committee problem: why B2B demos die after the form

https://blog.skipup.ai/buying-committee-demo-scheduling-problem/
1•bushido•19m ago•0 comments

Show HN: Seedance2.live – One place to try many AI image and video models

https://seedance2.live
1•yuni_aigc•23m ago•1 comments

Show HN: Grok Video 10s – Grok AI video generation and creator contest

https://grok-video.org/
2•thenextechtrade•24m ago•0 comments

Grumpy Julio plays with CLI coding agents

https://jmmv.dev/2026/02/one-week-with-claude-code.html
1•todsacerdoti•24m ago•0 comments

The Software Business

https://ivanbercovich.com/2026/the-software-business
2•jimmythecook•28m ago•0 comments

Monopoly Round-Up: The $2T Collapse of Terrible Software Companies

https://www.thebignewsletter.com/p/monopoly-round-up-the-2-trillion
1•walterbell•29m ago•0 comments

Show HN: HelixNotes – Local-first Markdown notes app built with Rust and Tauri

https://helixnotes.com
4•ArkHost•32m ago•0 comments

Security audit of Browser Use: prompt injection, credential exfil, domain bypass

https://arxiv.org/abs/2505.13076
2•tiny-automates•32m ago•1 comments

Infinite Terrain

https://mesq.me/infinite-terrain/
1•memalign•33m ago•1 comments

"What Questions Do You Have for Me?": Acing the Reverse Interview

https://robbygrodin.substack.com/p/what-questions-do-you-have-for-me
1•code_pig•33m ago•0 comments

Show HN: SplitFXM – Multi-Dimensional Computational Library for Physics-Aware AI

https://splitfxm.com
1•gpavanb•37m ago•0 comments

Tired of sharing small files via Google Drive or Dropbox just to manage access

https://www.styloshare.com/
1•stylofront•37m ago•1 comments

Scientists Send Secure Quantum Keys over 62Mi of Fiber–Without Trusted Devices

https://singularityhub.com/2026/02/09/scientists-send-secure-quantum-keys-over-62-miles-of-fiber-...
3•WaitWaitWha•39m ago•0 comments

Dr. YoungHoon Kim (claims world highest IQ of 276) speaks about Jesus

https://www.youtube.com/watch?v=B55knDYYWxM
1•quasibyte•40m ago•0 comments

CIA announces new acquisition framework to speed tech adoption

https://www.nextgov.com/acquisition/2026/02/cia-announces-new-acquisition-framework-speed-tech-ad...
2•WaitWaitWha•42m ago•0 comments

SBX Avalanche Survival System

https://www.safeback.no/sbx
1•dabinat•44m ago•0 comments

De-Enshittify Windows 11: OneDrive

https://www.thurrott.com/windows/windows-11/332529/de-enshittify-windows-11-onedrive
5•tech234a•47m ago•1 comments

Gradient.horse

https://gradient.horse
2•microflash•47m ago•0 comments

Claude /fast mode consumes money fast

1•diavelguru•53m ago•0 comments

CLIProxyAPIPlus – use antigravity, Gemini CLI, & more with Claude Code / etc.

https://github.com/router-for-me/CLIProxyAPIPlus
1•radio879•53m ago•1 comments

Electric Cars Are Making It Easier to Breathe, Study Finds

https://www.thedrive.com/news/electric-cars-are-making-it-easier-to-breath-study
2•m463•54m ago•0 comments