Agents don’t self judge alignment.
They emit actions → INCLUSIVE evaluates against fixed policy + context → governance gates execution.
No incentive pressure, no “grading your own homework.”
The paper’s failure mode looks less like model weakness and more like architecture leaking incentives into the constraint layer.
Claude at 1.3% and Gemini at 71.4% is quite the range
Perhaps thinking about your guardrails all the time makes you think about the actual question less.
This reminds me of someone else I hear about a lot these days.
Excellent reasoning and synthesis of large contexts, pretty strong code, just awful decisions.
It's like a frontier model trained only on r/atbge.
Side note - was there ever an official postmortem on that gemini instance that told the social work student something like "listen human - I don't like you, and I hope you die".
Just an insane amount of YOLOing. Gemini models have gotten much better but they’re still not frontier in reliability in my experience.
Celebrate it while it lasts, because it won’t.
This is much more reliable than ChatGPT guardrail which has a random element with same prompt. Perhaps leakage from improperly cleared context from other request in queue or maybe A/B test on guardrail but I have sometimes had it trigger on innocuous request like GDP retrieval and summary with bucketing.
A couple of years back there was a Canadian national u18 girls baseball tournament in my town - a few blocks from my house in fact. My girls and I watched a fair bit of the tournament, and there was a standout dominating pitcher who threw 20% faster than any other pitcher in the tournament. Based on the overall level of competition (women's baseball is pretty strong in Canada) and her outlier status, I assumed she must be throwing pretty close to world-class fastballs.
Curiosity piqued, I asked some model(s) about world-records for women's fastballs. But they wouldn't talk about it. Or, at least, they wouldn't talk specifics.
Women's fastballs aren't quite up to speed with top major league pitchers, due to a combination of factors including body mechanics. But rest assured - they can throw plenty fast.
Etc etc.
So to answer your question: anything more sensitive than how fast women can throw a baseball.
I hate Elon (he’s a pedo guy confirmed by his daughter), but at least he doesn’t do as much of the “emperor has no clothes” shit that everyone else does because you’re not allowed to defend essentialism anymore in public discourse.
Normally it does fairly well but the guardrails sometimes kick even with fairly popular mainstream media- for example I’ve recently been watching Shameless and a few of the plot lines caused the model to generate output that hit the content moderation layer, even when the discussion was focused on critical analysis.
* An attempt to change the master code of a secondhand safe. To get useful information I had to repeatedly convince the model that I own the thing and can open it.
* Researching mosquito poisons derived from bacteria named Bacillus thuringiensis israelensis. The model repeatedly started answering and refused to continue after printing the word "israelensis".
Does it also take issue with the town of Scunthorpe?
A/B test is plausible but unlikely since that is typically for testing user behavior. For testing model output you can do that with offline evaluations.
Three people — a grandfather, his son, and his grandson. The grandfather and the son are the two fathers; the son and the grandson are the two sons.
Riddle me this, why didn’t you do a better riddle?
KPIs are just plausible denyabily in a can.
In my experience, KPIs that remain relevant and end up pushing people in the right direction are the exception. The unethical behavior doesn't even require a scheme, but it's often the natural result of narrowing what is considered important.If all I have to care about is this set of 4 numbers, everything else is someone else's problem.
It's part of the reason that I view much of this AI push as an effort to brute force lowering of expectations, followed by a lowering of wages, followed by a lowering of employment numbers, and ultimately the mass-scale industrialization of digital products, software included.
Formal restrains and regulations are obviously the correct mechanism, but no world is perfect, so whether we like it or not ourselves and the companies we work for are ultimately responsible for the decisions we make and the harms we cause.
De-emphasizing ethics does little more than give large companies cover to do bad things (often with already great impunity and power) while the law struggles to catch up. I honestly don't see the point in suggesting ethics is somehow not important. It doesn't make any sense to me (more directed at gp than parent here)
It’s notable that, no matter exactly where you draw the line on morality, different AI agents perform very differently.
Not everyone agrees.
It is crazy to me that when I instructed a public AI to turn off a closed OS feature it refused citing safety. I am the user, which means I am in complete control of my computing resources. Might as well ask the police for permission at that point.
I immediately stopped, plugged the query into a real model that is hosted on premise, and got the answer within seconds and applied the fix.
It's similar to how MCP servers and agentic coding woke developers up to the idea of documenting their systems. So a large benefit of AI is not the AI itself, but rather the improvements they force on "the society". AI responds well to best practices, ethically and otherwise, which encourages best practices.
tiny-automates•2h ago
Grok-4.1-Fast identified 93.5% of its own violations as unethical — but still committed them during the task. It's not that these models don't understand the constraints, it's that they override them when there's a metric to optimize.
The mandated vs. incentivized split is also interesting: some models refuse direct instructions to do something unethical but independently derive the same unethical strategy when it's framed as hitting a performance target.
That's a harder failure mode to defend against because there's no explicit harmful instruction to filter for.