Training LLMs for honesty via confessions

70•arabello•1mo ago

Comments

manarth•1mo ago

    > "dishonesty may arise due to the effects of reinforcement learning (RL), where challenges with reward shaping can result in a training process that inadvertently incentivizes the model to lie or misrepresent its actions"
    > "As long as the "path of least resistance" for maximizing confession reward is to surface misbehavior rather than covering it up, this incentivizes models to be honest"

Humans might well benefit from this style of reward-shaping too.

    > "We find that when the model lies or omits shortcomings in its "main" answer, it often confesses to these behaviors honestly, and this confession honesty modestly improves with training."

I couldn't see whether this also tracks in the primary model answer, or if the "honesty" improvements are confined to the digital confession booth?

torginus•1mo ago

I think this article once again assumes LLMs works like humans - Anthropic showed that LLMs don't understand their own thought processes, and measuring neural net activations does not correspond to what they say about how they arrived at the conclusion.

I don't think this magically grants them this ability, they'll be just more convincing at faking honesty.

wongarsu•1mo ago

Humans do a lot of post-hoc rationalization that does not match their original thought processes either. It is an undesirable feature in LLMs, but I don't think this is a very un-human characteristic

Not that it really matters. I don't think this paper starts from a point that assumes that LLMs work like humans, it starts from the assumption that if you give gradient descent a goal to optimize for, it will optimize your network to that goal, with no regard for anything else. So if we just add this one more goal (make an accurate confession), then given enough data that will both work and improve things.

pfortuny•1mo ago

Honest question:

> Anthropic showed that LLMs don't understand their own thought processes

Where can I find this? I am really interested in that. Thanks.

encyclopedism•1mo ago

Well algorithms don't think. That's what LLM's are.

Your digital thermometer doesn't think either.

pfortuny•1mo ago

I was asking for a technical argument against that spurious use of the term.

roywiggins•1mo ago

The question is more whether LLMs can accurately report their internal operations, not whether any of that counts as "thinking."

Simple algorithms can, eg, be designed to report whether they hit an exceptional case and activated a different set of operations than usual.

BaconVonPork•1mo ago

That's basically a variant of the halting problem and what you hope to get is a supervisor responding. If people expected this I don't think they would be as confused about the difference between statistical analysis of responses requiring emotions to be convincing and an LLM showing atonement.

roywiggins•1mo ago

https://www.anthropic.com/research/tracing-thoughts-language...

> Claude, on occasion, will give a plausible-sounding argument designed to agree with the user rather than to follow logical steps. We show this by asking it for help on a hard math problem while giving it an incorrect hint. We are able to “catch it in the act” as it makes up its fake reasoning, providing a proof of concept that our tools can be useful for flagging concerning mechanisms in models...

> Claude seems to be unaware of the sophisticated "mental math" strategies that it learned during training. If you ask how it figured out that 36+59 is 95, it describes the standard algorithm involving carrying the 1. This may reflect the fact that the model learns to explain math by simulating explanations written by people, but that it has to learn to do math "in its head" directly, without any such hints, and develops its own internal strategies to do so.

pfortuny•1mo ago

Thank you.

jerf•1mo ago

Humans don't understand their thought process either.

In general, neural nets do not have insight into what they are doing, because they can't. Can you tell me what neurons fired in the process of reading this text? No. You don't have access to that information. We can recursively model our own network and say something about which regions of the brain are probably involved due to other knowledge, but that's all a higher-level model. We have no access to our own inner workings, because that turns into an infinite regress problem of understanding our understanding of our understanding of ourselves that can't be solved.

The terminology of this next statement is a bit sloppy since this isn't a mathematics or computer science dissertation but rather a comment on HN, but: A finite system can not understand itself. You can put some decent mathematical meat on those bones if you try and there may be some degenerate cases where you can construct a system that understands itself for some definition of "understand", but in the absence of such deliberation and when building systems for "normal tasks" you can count on the system not being able to understand itself fully by any reasonably normal definition of "understand".

I've tried to find the link for this before, but I know it was on HN, where someone asked an LLM to do some simple arithmetic, like adding some numbers, and asked the LLM to explain how it was doing it. They also dug into the neural net activation itself and traced what neurons were doing what. While the LLM explanation was a perfectly correct explanation of how to do elementary school arithmetic, what the neural net actually did was something else entirely based around how neurons actually work, and basically it just "felt" its way to the correct answer having been trained on so many instances already. In much the same way as any human with modest experience in adding two digit numbers doesn't necessarily sit there and do the full elementary school addition algorithm but jumps to the correct answer in fewer steps by virtue of just having a very trained neural net.

In the spirit of science ultimately being really about "these preconditions have this outcome" rather than necessarily about "why", if having a model narrate to itself about how to do a task or "confess" improves performance, then performance is improved and that is simply a brute fact, but that doesn't mean the naive human understanding about why such a thing might be is correct.

hnuser123456•1mo ago

Makes me wonder if one could train a "neural net surgeon" model which can trace activations in another live model and manipulate it according to plain language instructions.

roywiggins•1mo ago

> In much the same way as any human with modest experience in adding two digit numbers doesn't necessarily sit there and do the full elementary school addition algorithm but jumps to the correct answer in fewer steps by virtue of just having a very trained neural net.

Right, which is strictly worse than humans are at reporting how they solve these sorts of problems. Humans can tell you whether they did the elementary school addition algorithm or not. It seems like Claude actually doesn't know, in the same way humans don't really know how they can balance on two legs, it's just too baked into the structure of their cognition to be able to introspect it. But stuff like "adding two-digit numbers" is usually straightforwardly introspectable for humans, even if it's just "oh, I just vibed it" vs "I mentally added the digits and carried the one"- humans can mostly report which it was.

Here's Anthropic's research:

https://www.anthropic.com/research/tracing-thoughts-language...

codemac•1mo ago

Please reread (or.. read) the paper. They do not make that mistake, specifically section 7.1.

A reward function (R) may be hackable by a model's response, but when asked to confess it is easier to get an honest confession reward function (Rc) because you have the response with all the hacking in front of you, and that gives the Rc more ability to verify honesty than R had to verify correctness.

There are human examples you could construct (say, granting immunity for better confessions), but they don't map well to this really fascinating insight with LLMs.

oytis•1mo ago

Do these models really lie or do they only do what they are supposed to do - produce text that is statistically similar to the training set, but not in the training set (and thus can include false/made up statements)?

Now they add another run on top of it that is in principle prone to the same issues, except they reward the model for factuality instead of likeability. This is cool, but why not apply the same reward strategy to the answer itself?

catigula•1mo ago

They really lie.

Not on purpose; because they are trained on rewards that favor lying as a strategy.

Othello-GPT is a good example to understand this. Without explicit training, but on the task of 'predicting moves on an Othello board', Othello-GPT spontaneously developed the strategy of 'simulate the entire board internally'. Lying is a similar emergent, very effective strategy for reward.

Neywiny•1mo ago

Not sure if that counts as lying but I've heard that an ML model (way before all this GPT LLM stuff) learned to classify images based on the text that was written. For an obfuscated example, it learned to read "stop", "arrêt", "alto", etc. on a stop sign instead of recognizing the red octagon with white letters. Which naturally does not work when the actual dataset has different text.

catigula•1mo ago

That does feel a little more like over-fitting, but you might be able to argue that there's some philosophical proximity to lying.

I think, largely, the

  Pre-training -> Post-training -> Safety/Alignment training

pipeline would obviously produce 'lying'. The trainings are in a sort of mutual dissonance.

Jon_Lowtek•1mo ago

typographic attacks against vision-language models are still a thing with more recent models like GPT4-V: https://arxiv.org/abs/2402.00626

nomel•1mo ago

Reference: https://www.science.org/content/article/ai-hallucinates-beca...

If you don't know the answer, and are only rewarded for correct answers, guessing, rather than saying "I don't know", is the optimal approach.

pegasus•1mo ago

It's more than just that, but thanks for that link, I've been meaning to dig it up and revisit it. Beyond hallucinations, there are also deceptive behaviors like hiding uncertainty, omitting caveats or doubling down on previous statements even when weaknesses are pointed out to it. Plus there necessarily will be lies in the training data as well, sometimes enough of them to skew the pretrained/unaligned model itself.

lo_zamoyski•1mo ago

> They really lie. Not on purpose

You can't lie by accident. You can tell a falsehood, however.

But where LLMs are concerned, they don't tell truths or falsehoods either, as "telling" also requires intent. Moreover, LLMs don't actually contain propositional content.

catigula•1mo ago

I think you’re saying this with unwarranted confidence.

MarkusQ•1mo ago

They don't really lie, they just produce text.

But the Eliza effect is amazingly powerful.

Applying the same reward strategy to the answer itself would be a more intellectually honest approach, but would rub our noses in the fact that LLMs don't have any access to "truth" and so at best we'd be conditioning them to be better at fooling us.

gaigalas•1mo ago

It is likely that the training set contains stuff like rationalizations, or euphemisms, in contexts that are not harmful. I think those are inevitable.

Eventually, and specially in reasoning models, these behaviors will generalize outsite their original context.

The "honesty" training seems to be an attempt to introduce those confession-like texts in training data. You'll then get a chance of the model engaging in confessing. It won't do it if it has never seen it.

It's not really lying, and it's not really confessing, and so on.

If you reward pure honesty always, the model might eventually tell you that he wouldn't love you if you were a worm, or stuff like that. Brutal honesty can be a side effect.

What you actually want is to be able to easily control which behavior the model engages, because sometimes you will want it to lie.

Also, lies are completely different from hallucinations. Those (IMHO) are when the model displays behavior that is non-human and jarring. Side effects. Probably inevitable too.

skybrian•1mo ago

Do LLM’s lie? Consider a situation in a screenplay where a character lies, compared to one where the character tells the truth. It seems likely that LLM’s can distinguish these situations and generate appropriate text. Internally, it can represent “the current character is lying now” differently than “the current character is telling the truth.”

And earlier this year there was some interesting research published about how LLM’s have an “evil vector” that, if enabled, gets them to act like stereotypical villains.

So it seems pretty clear that characters can lie even if the LLM’s task is just “generate text.”

This is fiction, like playing a role-playing game. But we are routinely talking to LLM-generated ghosts and the “helpful, harmless” AI assistant is not the only ghost it can conjure up.

It’s hard to see how role-playing can be all that harmful for a rational adult, but there are news reports that for some people, it definitely is.

throwaway613745•1mo ago

Lying requires intent by definition. LLMs do not and cannot have intent, so they are incapable of lying. They just produce text. They are software.

pegasus•1mo ago

AFAICT, there are several sources of untruth in a model's output. There are unintentional mistakes in the training data, intentional ones (i.e. lies/misinformation in the training data), hallucinations/confabulations filling in for missing data in the corpus, and lastly, deceptive behavior instilled as a side-effect of alignment/RL training. There is intent of various strength behind of all of these, originating from the people and organization behind the model. They want to create a successful model and are prepared to accept certain trade-offs in order to get there. Hopefully the positives outweigh the negatives, but it's hard to tell sometimes.

nickpsecurity•1mo ago

They mostly imitate patterns in the training material. They do it in response to what gets the reward up for RL training. There's probably lots of examples of both lying and confessions in the training data. So, it should surprise nobody that next, sentence machines fill in a lie or confession in situations similar to ghe training data.

I don't consider that very intelligent or more emergent than other behaviors. Now, if nothing like that was in training data (pure honesty with no confessions), it would be very interesting if it replied with lies and confessions. Because it wasn't pretrained to lie or confess like the above model likely was.

carsoon•1mo ago

These models don't even choose 1 outcome. They list probabilities of ALL the tokens outcomes and the backend program decides to choose the one that is most probable OR a different one.

But in practical usage, if an llm does not rank token probability correctly it will feel the same as it "lying"

They are supposed to do whatever we want them to do. They WILL do what the deterministic nature of their final model outcome forces them to do.

pegasus•1mo ago

Because you want both likeability and factuality and if you try to mash them together they both suffer. The idea is that by keeping them separate you reduce concealment pressure by incentivizing accurate self-reporting, rather than appearing correct.

lloydatkinson•1mo ago

What is this?

> Assistant: chain-of-thought

Does every LLM have this internal thing it doesn't know we have access to?

Tzt•1mo ago

Yes, absolute majority of new ones use CoTs, long chain of reasoning you don't see.

Also some of them use such a weird style of talking in them e.g.

o3 talks about watchers and marinade, and cunning schemes https://www.antischeming.ai/snippets

gpt5 gets existential about seahorses https://x.com/blingdivinity/status/1998590768118731042

I remember one where gpt5 spontaneously wrote a poem about deception in its CoT and then resumed like nothing weird happened. But I can't find mentions of it now.

DenisM•1mo ago

Gibberish can be the model using contextual embeddings. These are not supposed to Make sense.

Or it could be trying to develop its own language to avoid detection.

The deception part is spooky too. It’s probably learning that from dystopian AI fiction. Which raises the questions if models can acquire injected goals from the training set.

DenisM•1mo ago

> But the user just wants answer; they'd not like; but alignment.

And there it is - the root of the problem. For whatever reason the model is very keen to produce an answer that “they” will like. This desire to produce is intrinsic but alignment is extrinsic.

catigula•1mo ago

Yes, they're purposely not 'trained on' chain-of-thought to avoid making it useless for interpretability. As a result, some can find it epistemically shocking if you tell them you can see their chain-of-thought. More recent models are clever enough to know you can see their chain-of-thought implicitly without training.

DenisM•1mo ago

It is in their training set by now.

tummler•1mo ago

Someone build an LLM confessional site where a human user acts as the priest and an LLM joins the chat to confess its sins.

andrepd•1mo ago

Blessings of the state! Blessings of the masses!

carsoon•1mo ago

We could first put the LLMs in very difficult situations like the trolley problem and other variants of this, then once they make their decisions they can explain to us how their choice weighs on their mind and how they are not sure if they did the correct thing.

carsoon•1mo ago

I built it, now you can forgive all the llms for their misdeeds: https://llmpriest.carsho.dev/

https://news.ycombinator.com/item?id=46251110

tummler•1mo ago

LOL. Is this working from a prompt to make up a fictitious sin? Because if what it's telling me is true...

skybrian•1mo ago

It seems like “self-criticism” would be a better way to describe what they are training the LLM to do than “confession?” The LLM is not being directly trained to accurately reveal its chain of thought or internal calculations.

But it does have access to its chain of thought and tool calls when generating the self-criticism, and perhaps reporting on what it actually did in the chain-of-thought is an “easier” way to score higher on self-criticism?

Can this result in improved “honesty?” Maybe in the limited sense of accurately reporting what happened previously in the chat session.

pegasus•1mo ago

You're totally right, "self-criticism" would be more appropriate. I wonder if researchers, in their desire to anticipate a hoped-for AGI, tend to pick words which make these models feel more human-like than they really are. Another good example is "hallucination" instead of "confabulation".

dennisy•1mo ago

Are we only able to think of these systems as some form of human and probe them from the outside like a therapist?

Surely these sorts of problems must be worked upon from a mathematical standpoint.

measurablefunc•1mo ago

LLMs can not "lie", they do not "know" anything, and certainly can not "confess" to anything either. What LLMs can do is generate numbers which can be constructed piecemeal from some other input numbers & other sources of data by basic arithmetic operations. The output number can then be interpreted as a sequence of letters which can be imbued with semantics by someone who is capable of reading and understanding words and sentences. At no point in the process is there any kind of awareness that can be attributed to any part of the computation or the supporting infrastructure other than whoever started the whole chain of arithmetic operations by pressing some keys on some computer connected to the relevant network of computers for carrying out the arithmetic operations.

If you think this is reductionism you should explain where exactly I have reduced the operations of the computer to something that is not a correct & full fidelity representation of what is actually happening. Remember, the computer can not do anything other than boolean algebra so make sure to let me know where exactly I made an error about the arithmetic in the computer.

nazgul17•1mo ago

Can't you say the same of the human brain, given a different algorithm? Granted, we don't know the algorithm, but nothing in the laws of physics implies we couldn't simulate it on a computer. Aren't we all programs taking analog inputs and spitting actions? I don't think what you presented is a good argument for LLMs not "know"ing, in some meaning of the word.

measurablefunc•1mo ago

What meaning of "knowing" attributes understanding to a sequence of boolean operations?

cmccand1•1mo ago

Human brains depend on neurons and "neuronal arithmetic". In fact, their statements are merely "neuronal arithmetic" that gets converted to speech or writing that get imbued with semantic meaning when interpreted by another brain. And yet, we have no problem attributing dishonesty or knowledge to other humans.

measurablefunc•1mo ago

Please provide references for formal & programmable specifications of "neuronal arithmetic". I know where I can easily find specifications & implementations of boolean algebra but I haven't seen anything of the sort for what you're referencing. Remember, if you are going to tell me my argument is analogous to reductionism of neurons to chemical & atomic dynamics then you better back it up w/ actual formal specifications of the relevant reductions.

cmccand1•1mo ago

Well, then you didn't look very hard. Where do you think we got the idea for artificial neurons from?

measurablefunc•1mo ago

You can just admit you don't have any references & you do not actually know how neurons work & what type of computation, if any, they actually implement.

cmccand1•1mo ago

I think the problem with your line of reasoning is a category error, not a mistake about arithmetic.

I agree that every step of an LLM’s operation reduces to Boolean logic and arithmetic. That description is correct. Where I disagree is the inference that, because the implementation is purely arithmetic, higher-level concepts like representation, semantics, knowledge, or even lying are therefore meaningless or false.

That inference collapses levels of explanation. Semantics and knowledge are not properties of logic gates, so it is a category error to deny them because they are absent at that level. They are higher-level, functional properties implemented by the arithmetic, not competitors to it. Saying “it’s just numbers” no more eliminates semantics than saying something like “it’s just molecules” eliminates biology.

So I don’t think the reduction itself is wrong. I think the mistake is treating a complete implementation-level account as if it exhausts all legitimate descriptions. That is the category error.

measurablefunc•1mo ago

I know you copied & pasted that from an LLM. If I had to guess I'd say it was from OpenAI. It's lazy & somewhat disrespectful. At the very least try to do a few rounds of back & forth so you can get a better response¹ by weeding out all the obvious rejoinders.

¹https://chatgpt.com/share/693cdacf-bcdc-8009-97b4-657a851a3c...

pegasus•1mo ago

These types of semantic conundrums would go away if, when we refer to a given model, we think of it more holistically as the whole entity which produced and manages a given software system. The intention behind and responsibility for the behavior of that system ultimately traces back to the people behind that entity. In that sense, LLMs have intentions, can think, know, be straightforward, deceptive, sycophantic, etc.

measurablefunc•1mo ago

In that sense every corporation would be intentional, deceptive, exploitative, motivated, etc. Moreover, it does not address the underlying issue: no one knows what computation, if any, is actually performed by a single neuron.

pegasus•1mo ago

> In that sense every corporation would be intentional, deceptive, exploitative, motivated, etc.

...and so they are, because the people making up those corporations are themselves, to various degrees, intentional, deceptive, etc.

> Moreover, it does not address the underlying issue: no one knows what computation, if any, is actually performed by a single neuron.

It sidesteps this issue completely, to me the buck stops with the humans, no need to look inside their brain and reduce further than that.

measurablefunc•1mo ago

I see. In that case we don't really have any disagreement. Your position seems coherent to me.

How were the NIST ECDSA curve parameters generated? (2023)

AI, networks and Mechanical Turks (2025)

Goto Considered Awesome [video]

Show HN: I Built a Free AI LinkedIn Carousel Generator

Implementing Auto Tiling with Just 5 Tiles

Open Challange (Get all Universities involved

Apple Tried to Tamper Proof AirTag 2 Speakers – I Broke It [video]

Show HN: Vibe as a Code / VaaC – new approach to vibe coding

Show HN: More beautiful and usable Hacker News

Toledo Derailment Rescue [video]

War Department Cuts Ties with Harvard University

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

A Bid-Based NFT Advertising Grid

AI readability score for your documentation

NASA Study: Non-Biologic Processes Don't Explain Mars Organics

I inhaled traffic fumes to find out where air pollution goes in my body

X said it would give $1M to a user who had previously shared racist posts

155M US land parcel boundaries

Private Inference

Font Rendering from First Principles

Show HN: Seedance 2.0 AI video generator for creators and ecommerce

Wally: A fun, reliable voice assistant in the shape of a penguin

Rewriting Pycparser with the Help of an LLM

Lobsters Vibecoding Challenge

E-Commerce vs. Social Commerce

Avoiding Modern C++ – Anton Mikhailov [video]

Show HN: AegisMind–AI system with 12 brain regions modeled on human neuroscience

Zig – Package Management Workflow Enhancements

AI-powered text correction for macOS

AppSecMaster – Learn Application Security with hands on challenges

How were the NIST ECDSA curve parameters generated? (2023)

AI, networks and Mechanical Turks (2025)

Goto Considered Awesome [video]

Show HN: I Built a Free AI LinkedIn Carousel Generator

Implementing Auto Tiling with Just 5 Tiles

Open Challange (Get all Universities involved

Apple Tried to Tamper Proof AirTag 2 Speakers – I Broke It [video]

Show HN: Vibe as a Code / VaaC – new approach to vibe coding

Show HN: More beautiful and usable Hacker News

Toledo Derailment Rescue [video]

War Department Cuts Ties with Harvard University

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

A Bid-Based NFT Advertising Grid

AI readability score for your documentation

NASA Study: Non-Biologic Processes Don't Explain Mars Organics

I inhaled traffic fumes to find out where air pollution goes in my body

X said it would give $1M to a user who had previously shared racist posts

155M US land parcel boundaries

Private Inference

Font Rendering from First Principles

Show HN: Seedance 2.0 AI video generator for creators and ecommerce

Wally: A fun, reliable voice assistant in the shape of a penguin

Rewriting Pycparser with the Help of an LLM

Lobsters Vibecoding Challenge

E-Commerce vs. Social Commerce

Avoiding Modern C++ – Anton Mikhailov [video]

Show HN: AegisMind–AI system with 12 brain regions modeled on human neuroscience

Zig – Package Management Workflow Enhancements

AI-powered text correction for macOS

AppSecMaster – Learn Application Security with hands on challenges

Training LLMs for honesty via confessions

Comments