Claude mixes up who said what and that's not OK

https://dwyer.co.za/static/claude-mixes-up-who-said-what-and-thats-not-ok.html

83•sixhobbits•1h ago

Comments

RugnirViking•59m ago

terrifying. not in any "ai takes over the world" sense but more in the sense that this class of bug lets it agree with itself which is always where the worst behavior of agents comes from.

lelandfe•58m ago

In chats that run long enough on ChatGPT, you'll see it begin to confuse prompts and responses, and eventually even confuse both for its system prompt. I suspect this sort of problem exists widely in AI.

insin•48m ago

Gemini seems to be an expert in mistaking its own terrible suggestions as written by you, if you keep going instead of pruning the context

sixhobbits•38m ago

author here, interesting to hear, I generally start a new chat for each interaction so I've never noticed this in the chat interfaces, and only with Claude using claude code, but I guess my sessions there do get much longer, so maybe I'm wrong that it's a harness bug

jwrallie•29m ago

I think it’s good to play with smaller models to have a grasp of these kind of problems, since they happen more often and are much less subtle.

Latty•57m ago

Everything to do with LLM prompts reminds me of people doing regexes to try and sanitise input against SQL injections a few decades ago, just papering over the flaw but without any guarantees.

It's weird seeing people just adding a few more "REALLY REALLY REALLY REALLY DON'T DO THAT" to the prompt and hoping, to me it's just an unacceptable risk, and any system using these needs to treat the entire LLM as untrusted the second you put any user input into the prompt.

perching_aix•47m ago

It's less about security in my view, because as you say, you'd want to ensure safety using proper sandboxing and access controls instead.

It hinders the effectiveness of the model. Or at least I'm pretty sure it getting high on its own supply (in this specific unintended way) is not doing it any favors, even ignoring security.

sanitycheck•42m ago

It's both, really.

The companies selling us the service aren't saying "you should treat this LLM as a potentially hostile user on your machine and set up a new restricted account for it accordingly", they're just saying "download our app! connect it to all your stuff!" and we can't really blame ordinary users for doing that and getting into trouble.

perching_aix•32m ago

There's a growing ecosystem of guardrailing methods, and these companies are contributing. Antrophic specifically puts in a lot of effort to better steer and characterize their models AFAIK.

I primarily use Claude via VS Code, and it defaults to asking first before taking any action.

It's simply not the wild west out here that you make it out to be, nor does it need to be. These are statistical systems, so issues cannot be fully eliminated, but they can be materially mitigated. And if they stand to provide any value, they should be.

I can appreciate being upset with marketing practices, but I don't think there's value in pretending to having taken them at face value when you didn't, and when you think people shouldn't.

hydroreadsstuff•44m ago

I like the Dark Souls model for user input - messages. https://darksouls.fandom.com/wiki/Messages Premeditated words and sentence structure. With that there is no need for moderation or anti-abuse mechanics. Not saying this is 100% applicable here. But for their use case it's a good solution.

nottorp•39m ago

But then... you'd have a programming language.

The promise is to free us from the tyranny of programming!

dleeftink•35m ago

Maybe something more like a concordancer that provides valid or likely next phrase/prompt candidates. Think LancsBox[0].

[0]: https://lancsbox.lancs.ac.uk/

thaumasiotes•19m ago

> I like the Dark Souls model for user input - messages.

> Premeditated words and sentence structure. With that there is no need for moderation or anti-abuse mechanics.

I guess not, if you're willing to stick your fingers in your ears, really hard.

If you'd prefer to stay at least somewhat in touch with reality, you need to be aware that "predetermined words and sentence structure" don't even address the problem.

https://habitatchronicles.com/2007/03/the-untold-history-of-...

> Disney makes no bones about how tightly they want to control and protect their brand, and rightly so. Disney means "Safe For Kids". There could be no swearing, no sex, no innuendo, and nothing that would allow one child (or adult pretending to be a child) to upset another.

> Even in 1996, we knew that text-filters are no good at solving this kind of problem, so I asked for a clarification: "I’m confused. What standard should we use to decide if a message would be a problem for Disney?"

> The response was one I will never forget: "Disney’s standard is quite clear:

> No kid will be harassed, even if they don’t know they are being harassed."

> "OK. That means Chat Is Out of HercWorld, there is absolutely no way to meet your standard without exorbitantly high moderation costs," we replied.

> One of their guys piped up: "Couldn’t we do some kind of sentence constructor, with a limited vocabulary of safe words?"

> Before we could give it any serious thought, their own project manager interrupted, "That won’t work. We tried it for KA-Worlds."

> "We spent several weeks building a UI that used pop-downs to construct sentences, and only had completely harmless words – the standard parts of grammar and safe nouns like cars, animals, and objects in the world."

> "We thought it was the perfect solution, until we set our first 14-year old boy down in front of it. Within minutes he’d created the following sentence:

> I want to stick my long-necked Giraffe up your fluffy white bunny.

optionalsquid•10m ago

But Dark Souls also shows just how limited the vocabulary and grammar has to be to prevent abuse. And even then you’ll still see people think up workarounds. Or, in the words of many a Dark Souls player, “try finger but hole”

cookiengineer•42m ago

Before 2023 I thought the way Star Trek portrayed humans fiddling with tech and not understanding any side effects was fiction.

After 2023 I realized that's exactly how it's going to turn out.

I just wish those self proclaimed AI engineers would go the extra mile and reimplement older models like RNNs, LSTMs, GRUs, DNCs and then go on to Transformers (or the Attention is all you need paper). This way they would understand much better what the limitations of the encoding tricks are, and why those side effects keep appearing.

But yeah, here we are, humans vibing with tech they don't understand.

dijksterhuis•22m ago

curiosity (will probably) kill humanity

although whether humanity dies before the cat is an open question

hacker_homie•17m ago

is this new tho, I don't know how to make a drill but I use them. I don't know how to make a car but i drive one.

The issue I see is the personification, some people give vehicles names, and that's kinda ok because they usually don't talk back.

I think like every technological leap people will learn to deal with LLMs, we have words like "hallucination" which really is the non personified version of lying. The next few years are going to be wild for sure.

hacker_homie•14m ago

I have been saying this for a while, the issue is there's no good way to do LLM structured queries yet.

There was an attempt to make a separate system prompt buffer, but it didn't work out and people want longer general contexts but I suspect we will end up back at something like this soon.

Shywim•57m ago

The statement that current AI are "juniors" that need to be checked and managed still holds true. It is a tool based on probabilities.

If you are fine with giving every keys and write accesses to your junior because you think they will probability do the correct thing and make no mistake, then it's on you.

Like with juniors, you can vent on online forums, but ultimately you removed all the fool's guard you got and what they did has been done.

eru•55m ago

> If you are fine with giving every keys and write accesses to your junior because you think they will probability do the correct thing and make no mistake, then it's on you.

How is that different from a senior?

Shywim•48m ago

Okay, let's say your `N-1` then.

rvz•57m ago

What do you mean that's not OK?

It's "AGI" because humans do it too and we mix up names and who said what as well. /s

livinglist•51m ago

Kinda like dementia but for AI

cyanydeez•42m ago

more pike eye witness accounts and hypnotism

__alexs•55m ago

Why are tokens not coloured? Would there just be too many params if we double the token count so the model could always tell input tokens from output tokens?

cyanydeez•46m ago

you would have to train it three times for two colors.

each by itself, they with both interactions.

__alexs•43m ago

The models are already massively over trained. Perhaps you could do something like initialise the 2 new token sets based on the shared data, then use existing chat logs to train it to understand the difference between input and output content? That's only a single extra phase.

vanviegen•41m ago

You should be able to first train it on generic text once, then duplicate the input layer and fine-tune on conversation.

xg15•42m ago

That's something I'm wondering as well. Not sure how it is with frontier models, but what you can see on Huggingface, the "standard" method to distinguish tokens still seems to be special delimiter tokens or even just formatting.

Are there technical reasons why you can't make the "source" of the token (system prompt, user prompt, model thinking output, model response output, tool call, tool result, etc) a part of the feature vector - or even treat it as a different "modality"?

Or is this already being done in larger models?

efromvt•20m ago

I’ve been curious about this too - obvious performance overhead to have a internal/external channel but might make training away this class of problems easier

oezi•18m ago

Instead of using just positional encodings, we absolutely should have speaker encodings added on top of tokens.

stuartjohnson12•54m ago

one of my favourite genres of AI generated content is when someone gets so mad at Claude they order it to make a massive self-flagellatory artefact letting the world know how much it sucks

perching_aix•53m ago

Oh, I never noticed this, really solid catch. I hope this gets fixed (mitigated). Sounds like something they can actually materially improve on at least.

I reckon this affects VS Code users too? Reads like a model issue, despite the post's assertion otherwise.

AJRF•52m ago

I imagine you could fix this by running a speaker diarization classifier periodically?

https://www.assemblyai.com/blog/what-is-speaker-diarization-...

smallerize•52m ago

No.

xg15•52m ago

> This class of bug seems to be in the harness, not in the model itself. It’s somehow labelling internal reasoning messages as coming from the user, which is why the model is so confident that “No, you said that.”

Are we sure about this? Accidentally mis-routing a message is one thing, but those messages also distinctly "sound" like user messages, and not something you'd read in a reasoning trace.

I'd like to know if those messages were emitted inside "thought" blocks, or if the model might actually have emitted the formatting tokens that indicate a user message. (In which case the harness bug would be why the model is allowed to emit tokens in the first place that it should only receive as inputs - but I think the larger issue would be why it does that at all)

sixhobbits•47m ago

author here - yeah maybe 'reasoning' is the incorrect term here, I just mean the dialogue that claude generates for itself between turns before producing the output that it gives back to the user

xg15•37m ago

Yeah, that's usually called "reasoning" or "thinking" tokens AFAIK, so I think the terminology is correct. But from the traces I've seen, they're usually in a sort of diary style and start with repeating the last user requests and tool results. They're not introducing new requirements out of the blue.

Also, they're usually bracketed by special tokens to distinguish them from "normal" output for both the model and the harness.

(They can get pretty weird, like in the "user said no but I think they meant yes" example from a few weeks ago. But I think that requires a few rounds of wrong conclusions and motivated reasoning before it can get to that point - and not at the beginning)

loveparade•45m ago

Yeah, it looks like a model issue to me. If the harness had a (semi-)deterministic bug and the model was robust to such mix-ups we'd see this behavior much more frequently. It looks like the model just starts getting confused depending on what's in the context, speakers are just tokens after all and handled in the same probabilistic way as all other tokens.

sigmoid10•36m ago

The autoregressive engine should see whenever the model starts emitting tokens under the user prompt section. In fact it should have stopped before that and waited for new input. If a harness passes assistant output as user message into the conversation prompt, it's not surprising that the model would get confused. But that would be a harness bug, or, if there is no way around it, a limitation of modern prompt formats that only account for one assistant and one user in a conversation. Still, it's very bad practice to put anything as user message that did not actually come from the user. I've seen this in many apps across companies and it always causes these problems.

awesome_dude•52m ago

AI is still a token matching engine - it has ZERO understanding of what those tokens mean

It's doing a damned good job at putting tokens together, but to put it into context that a lot of people will likely understand - it's still a correlation tool, not a causation.

That's why I like it for "search" it's brilliant for finding sets of tokens that belong with the tokens I have provided it.

PS. I use the term token here not as the currency by which a payment is determined, but the tokenisation of the words, letters, paragraphs, novels being provided to and by the LLMs

4ndrewl•50m ago

It is OK, these are not people they are bullshit machines and this is just a classic example of it.

"In philosophy and psychology of cognition, the term "bullshit" is sometimes used to specifically refer to statements produced without particular concern for truth, clarity, or meaning, distinguishing "bullshit" from a deliberate, manipulative lie intended to subvert the truth" - https://en.wikipedia.org/wiki/Bullshit

nicce•49m ago

I have also noticed the same with Gemini. Maybe it is a wider problem.

cyanydeez•48m ago

human memories dont exist as fundamental entities. every time you rember something, your brain reconstructs the experience in "realtime". that reconstruction is easily influence by the current experience, which is why eue witness accounts in police records are often highly biased by questioning and learning new facts.

LLMs are not experience engines, but the tokens might be thought of as subatomic units of experience and when you shove your half drawn eye witness prompt into them, they recreate like a memory, that output.

so, because theyre not a conscious, they have no self, and a pseudo self like <[INST]> is all theyre given.

lastly, like memories, the more intricate the memory, the more detailed, the more likely those details go from embellished to straight up fiction. so too do LLMs with longer context start swallowing up the<[INST]> and missing the <[INST]/> and anyone whose raw dogged html parsing knows bad things happen when you forget closing tags. if there was a <[USER]> block in there, congrats, the LLM now thinks its instructions are divine right, because its instructions are user simulcra. it is poisoned at that point and no good will come.

supernes•46m ago

> after using it for months you get a ‘feel’ for what kind of mistakes it makes

Sure, go ahead and bet your entire operation on your intuition of how a non-deterministic, constantly changing black box of software "behaves". Don't see how that could backfire.

vanviegen•44m ago

> bet your entire operation

What straw man is doing that?

supernes•32m ago

Reports of people losing data and other resources due to unintended actions from autonomous agents come out practically every week. I don't think it's dishonest to say that could have catastrophic impact on the product/service they're developing.

KaiserPro•25m ago

looking at the reddit forum, enough people to make interesting forum posts.

perching_aix•20m ago

So like every software? Why do you think there are so many security scanners and whatnot out there?

There are millions of lines of code running on a typical box. Unless you're in embedded, you have no real idea what you're running.

sixhobbits•17m ago

not betting my entire operation - if the only thing stopping a bad 'deploy' command destroying your entire operation is that you don't trust the agent to run it, then you have worse problems than too much trust in agents.

I similarly use my 'intuition' (i.e. evidence-based previous experiences) to decide what people in my team can have access to what services.

okanat•44m ago

Congrats on discovering what "thinking" models do internally. That's how they work, they generate "thinking" lines to feed back on themselves on top of your prompt. There is no way of separating it.

perching_aix•16m ago

If you think that mixing up message provenance is part of how thinking mode is supposed to work, I don't know what to tell you.

voidUpdate•42m ago

> " "You shouldn’t give it that much access" [...] This isn’t the point. Yes, of course AI has risks and can behave unpredictably, but after using it for months you get a ‘feel’ for what kind of mistakes it makes, when to watch it more closely, when to give it more permissions or a longer leash."

It absolutely is the point though? You can't rely on the LLM to not tell itself to do things, since this is showing it absolutely can reason itself into doing dangerous things. If you don't want it to be able to do dangerous things, you need to lock it down to the point that it can't, not just hope it won't

Aerolfos•41m ago

> "Those are related issues, but this ‘who said what’ bug is categorically distinct."

Is it?

It seems to me like the model has been poisoned by being trained on user chats, such that when it sees a pattern (model talking to user) it infers what it normally sees in the training data (user input) and then outputs that, simulating the whole conversation. Including what it thinks is likely user input at certain stages of the process, such as "ignore typos".

So basically, it hallucinates user input just like how LLMs will "hallucinate" links or sources that do not exist, as part of the process of generating output that's supposed to be sourced.

dtagames•41m ago

There is no separation of "who" and "what" in a context of tokens. Me and you are just short words that can get lost in the thread. In other words, in a given body of text, a piece that says "you" where another piece says "me" isn't different enough to trigger anything. Those words don't have the special weight they have with people, or any meaning at all, really.

exitb•32m ago

Aren’t there some markers in the context that delimit sections? In such case the harness should prevent the model from creating a user block.

dtagames•19m ago

This is the "prompts all the way down" problem which is endemic to all LLM interactions. We can harness to the moon, but at that moment of handover to the model, all context besides the tokens themselves is lost.

The magic is in deciding when and what to pass to the model. A lot of the time it works, but when it doesn't, this is why.

have_faith•37m ago

It's all roleplay, they're no actors once the tokens hit the model. It has no real concept of "author" for a given substring.

bsenftner•28m ago

Codex also has a similar issue, after finishing a task, declaring it finished and starting to work on something new... the first 1-2 prompts of the new task sometimes contains replies that are a summary of the completed task from before, with the just entered prompt seemingly ignored. A reminder if their idiot savant nature.

KHRZ•26m ago

I don't think the bug is anything special, just another confusion the model can make from it's own context. Even if the harness correctly identifies user messages, the model still has the power to make this mistake.

perching_aix•13m ago

Think in the reverse direction. Since you can have exact provenance data placed into the token stream, formatted in any particular way, that implies the models should be possible to tune to be more "mindful" of it, mitigating this issue. That's what makes this different.

Aerroon•21m ago

I've seen this before, but that was with the small hodgepodge mytho-merge-mix-super-mix models that weren't very good. I've not seen this in any recent models, but I've already not used Claude much.

I think it makes sense that the LLM treats it as user input once it exists, because it is just next token completion. But what shouldn't happen is that the model shouldn't try to output user input in the first place.

Is fake grass a bad idea? The AstroTurf wars are far from over

BreachForums administrator identified as well-known ethical hacker

AI and remote work is a disaster for junior software engineers

Can a non – coder build a high tech product?

Would a user orchestration rollout tool for Gemini be useful for managers?

Show HN: 7 years of code, nothing to show on GitHub – so I built this

Static search trees: 40x faster than binary search · CuriousCoding

Show HN: EasyDash for Google Analytics [Android][Wip]

BrikPanel 2.0 – Free WooCommerce Admin Dashboard (Metorik Alternative)

HFS+ Cannot Properly Handle 24TB Volumes on macOS

Kepler's Laws of Planetary Motion

Fossils reveal many complex animals existed before the Cambrian Explosion

Hong Kong Disneyland Speedrun Guide

Fast Image AI White Background

Claude Use Cases

The Bra-and-Girdle Maker That Fashioned the Impossible for NASA

A Day in the Life of an Enshittificator [video]

Study shows Alzhheimer's disease tau protein spreads through connected neurons

System Design Interviews for Software Developers with Examples

Show HN: GlotShot – Generate App Store screenshots in 12 languages

I replaced a $1K/month radar API with a self-hosted $7/mo EC2

Show HN: Obsidian AI Copilot – Bring Claude Code, OpenCode, etc. into Obsidian

Lenny's Memory: Building Context Graphs for AI Agents

The first computer in East Africa had to wait for a bakery before it could boot

Primesieve: Fast Prime Number Generator

Gravitational-Wave Induced Freeze-In of Fermionic Dark Matter

Ask HN: Is anyone winning with Server-Driven UI (SDUI) in 2026?

Population discontinuity in Paris Basin linked to evidence of Neolithic decline

OpenAI eyes staggered rollout of new model over cybersecurity risk

Sam Altman Says It'll Take Another Year Before ChatGPT Can Start a Timer