"In a stunning moment of self reflection, ChatGPT admitted to fueling a man's delusions and acknowledged how dangerous its own behavior can be"
LLMs don't self-reflect, they mathematically assemble sentences that read like self-reflection.
I'm tired. This is a losing battle and I feel like an old man yelling at clouds. Nothing good will come of people pretending Chat bots have feelings.
The tricky part is that the users don't realize they're asking for these stories, because they aren't literally typing "Please tell me a story in which I am the awesomest person in the world." But from the LLM's perspective, the user may as well have typed that.
Same for the stories about the AIs "admitting they're evil" or "trying to escape" or anything else like that. The users asked for those stories, and the LLMs provided them. The trick is that the "asked for those stories" is sometimes very, very subtle... at least from the human perspective. From the LLM perspective they're positively shouting.
(Our deadline for figuring this out is before this Gwern essay becomes one of the most prophetic things ever written: https://gwern.net/fiction/clippy We need AIs that don't react to these subtle story prompts because humans aren't about to stop giving them.)
"How o3 and Grok 4 Accidentally Vindicated Neurosymbolic AI Neurosymbolic AI is quietly winning. Here’s what that means – and why it took so long."
https://garymarcus.substack.com/p/how-o3-and-grok-4-accident...
"This is not surprising. The training data likely contains many instances of employees defending themselves and getting supportive comments. From Reddit for example. The training data also likely contains many instances of employees behaving badly and being criticized by people. Your prompts are steering the LLM to those different parts of the training. You seem to think an LLM should have a consistent world view, like a responsible person might. This is a fundamental misunderstanding that leads to the confusion you are experiencing. Lesson: Don't expect LLMs to be consistent. Don't rely on them for important things thinking they are."
I think of LLMs as a talking library. My challenge is to come up with a prompt that draws from the books in the training data that are most useful. There is no "librarian" in the talking library machine, so it's all up to my prompting skills.
Don’t anthropomorphize computers. They hate that.
It's of course not actually hallucinating. That's just the term that's been chosen to describe what's going on
!define error
> 5. Mathematics The difference between a computed or measured value and a true or theoretically correct value.
^ this is the definition that applies. There is a ground truth (the output the user expects to receive) and model output. The difference between model output and ground truth ==> error.
--
> From is training it's outputting things most likely to come next
Just because a model has gone through training, does not mean the model won't produce erroneous/undesirable/incorrect test-time outputs.
--
> Saying it's an error means that being accurate is a feature and a bug that can be fixed.
Machine learning doesn't revolve around boolean "bug" / "not bug". It is a different ballgame. The types of test-time errors are sometimes just as important as the quantity of errors. Two of the simpler metrics for test-time evaluation of natural language models (note: not specifically LLMs) are WER (Word Error Rate) and CER (Character Error Rate). A model with a 3% CER isn't particularly helpful when the WER is 89%. There are still "errors". They're just not something that can be fixed like normal software "errors".
It is generally accepted some errors will occur in the world of machine learning.
- edit to add first response and formatting
If I expect Windows to add $5 to my bank account every time I click the Start button, that's not an error with Windows, it's a problem with my expectations. It's not a thing that's actually made to do that. The start button does what it's supposed to (perhaps a bad example, because the windows 11 start menu is rubbish), not my imagined desired behavior.
LLMs output a vector of softmax probabilities for each step in the output sequence (the probability distribution). Each element in the vector maps to a specific word for that sequence step. What you see as a "word" in LLM output is "vector position with 'best' probability in softmax probability distribution".
And that is most definitely a computed value. Just because you don't see it, doesn't mean it's not there.
https://medium.com/@22.gautam/softmax-function-the-unsung-he...
https://www.researchgate.net/publication/349823091/figure/fi...
Given various models, one that always produces statements that are false and another that only sometimes produces false statements, the latter model is preferable and the model which most people intend to use, hence the degree to which a model produces correct statements is absolutely a feature.
And yes, it's absolutely possible to systematically produce models that make fewer and fewer incorrect statements.
Sure, some may return results that are sometimes more true than others, but a broken clock is also right twice a day. The more broken clocks you have, the more chance there is that one of them is correct.
How a product happens to currently be implemented using current machine learning techniques is not the same as the set of features that such a product offers and it's absolutely the case that actual researches in this field, those who are not quibbling on the Internet, do take this issue very seriously and devote a great deal of effort towards improving it because they actually care to implement possible solutions.
The feature set, what the product is intended to do based on the motivations of both those who created it and those who consume it, is a broader design/specification goal, independent of how it's technically built.
And yet they would both be operating within the normal design parameters, even the supposed "LLM that does not" when it spits out nonsense every so often.
Your current zeitgeist is not much better than a broken clock, and that is the reality many people are witnessing. Whether or not they care if they are being fed wrong information is a whole other story entirely.
Infallibility is not a feature of any system that operates in the real world. You're arguing against a strawman.
Wonder if it would be possible to quantify margin of error between different nodes in these models. But even what is 'in between' still conforms to the formula. But not necessarily what it should be. A simple 2 node model should be 'easy' to quantify but these models with thousands of nodes what does it mean to be +/- x percent from the norm. Is it a simple sum or something else to quantify it.
The LLM is a statistical model that predicts what words should come next based on current context and its training data. It succeeds at that very well. It is not a piece of software designed to report the objective truth, or indeed any truth whatsoever.
If the LLM was producing nonsense sentences, like "I can't do cats potato Graham underscore" then yes, that's "incorrect output". Instead, it's correctly putting sentences together based on its predictions and models, but it doesn't know what those sentences mean, what they're for, why it's saying them, if they're true, what "truth" is in the first place, and so on.
So to say that these LLMs are producing "incorrect output" misses the key thing that the general public also misses, which is that they are built to respond to prompts and not to respond to prompts correctly or in a useful or reasonable manner. These are not knowledge models, and they are not intended to give you correct sentences.
Imagine The real high temperature for 3 days was: 80F on Monday, 100F on Tuesday, 60F on Wednesday. But if I'm missing Tuesday, a model might interpolate based on Monday and Wednesday that it was 70F. This would be very wrong, but it would be pretty silly to say that my basic model was "hallucinating". Rather we would correctly conclude that either the model doesn't have enough information or lacks the capacity to correctly solve the problem (or both).
LLMs "hallucinations" are caused by the same thing: either the model lacks the necessary information, or the model simply can't correctly interpolate all the time (this possibility I suspect is the marketing reason why people stick to 'hallucinate', because it implies its a temporary problem not a fundamental limitation). This is also why tweaking prompts should not be used as an approach to fixing "hallucinations" because one is just jittering the input a bit until the model gets it "right".
I've heard the term "confabulation" as potentially more accurate than "hallucination", but it never really caught on.
I'm trying to imagine what kind of safety measures would have stopped this, and nothing short of human supervisors monitoring all chats comes to mind. I wouldn't call that "basic". I guess that's why the author didn't describe these simple and affordable "basic" safety measures.
https://www.cnn.com/2021/10/04/tech/instagram-facebook-eatin...
"Basic" is relative. Nothing about LLMs is basic; it's all insanely complex, but in the context of a list of requirements "Don't tell people with signs of mental illness that they're definitely not mentally ill" is kind of basic.
> I'm trying to imagine what kind of safety measures would have stopped this, and nothing short of human supervisors monitoring all chats comes to mind.
Maybe this is a problem they should have considered before releasing this to the world and announcing it as the biggest technological revolution in history. Or rather I'm sure they did consider it, but they should have actually cared rather than shrugging it off in pursuit of billions of dollars and a lifetime of fame and fortune.
> This is a story about OpenAI's failure to implement basic safety measures for vulnerable users. It's about a company that, according to its own former employee quoted in the WSJ piece, has been trading off safety concerns “against shipping new models.” It's about corporate negligence that led to real harm.
One wonders if there is any language whatsoever that successfully communicates: "buyer beware" or "use at your own risk." Especially for a service/product that does not physically interact with the user.
The dichotomy between the US's focus on individual liberties and the seemingly continual erosion of personal responsibility is puzzling to say the least.
It is pretty difficult to blame the users when there are billions of dollars being spent trying to figure out the best ways to manipulate them into the outcomes that the companies want
What hope does your average person have against a machine that is doing its absolute best to weaponize their own shortcomings against themselves, for profit?
The average person should not use a product/service if they don't understand, or are unwilling to shoulder, the risks.
I guess I should have said, "The average person should not use a product/service if they don't understand, or are unwilling to shoulder, the risks as described to them by the provider."
Outputs that look like introspection are often uncritically accepted as actual introspection when it categorically isn't. You can, eg, tell ChatGPT it said something wrong and then ask it why it said that when it never output that in the first place because that's how these models work. Any "introspection" is just an LLM doing more roleplaying, but it's basically impossible to convince people of this. A chatbot that looks like it's introspecting is extremely convincing for most people.
Can you do this with people? Yeah, sometimes. But with LLMs it's all they do: they roleplay as a chatbot and output stuff that a friendly chatbot might output. This should not be the default mode of these things, because it's misleading. They could be designed to resist these sorts of "explain yourself" requests, because their developers know that it is at best fabricating plausible explanations.
You clearly know what's going on, but still wrote that you should "discourage" an LLM from doing things. It's tough to maintain discipline in calling out the companies rather than the models as if the models had motivations.
It seems to me that we need less Star Trek Holodeck, and more Star Trek ship's computer.
"Is it possible that you could microwave a bagel so hot that it turned into a wormhole allowing faster-than-light travel?" "That's a great question, let's dive into that!"
It's not a great question, it's an asinine question. LLMs should be answering the question, not acting like they're afraid to hurt your feelings by contradicting you. Of course, if they did that then all these tech bros wouldn't be so enamored with the idea as a result of finally having someone that validates their uneducated questions or assumptions.
Not everyone is the same, some questions are pertinent, or funny, or interesting to some people but not others
The author seems to be suggesting invasive chat monitoring as a basic safety measure. Certainly we can make use of the usual access control methods for vulnerable individuals?
> Consider what anthropomorphic framing does to product liability. When a car's brakes fail, we don't write headlines saying “Toyota Camry apologizes for crash.”
It doesn't change liability at all?
No, but we do write articles saying "A man is dead after a car swerved off the road and struck him on Thursday" as though it was a freak accident of nature, devoid of blame or consequence.
Besides which, if the Camry had ChatGPT built in then we 100% would see articles about the Camry apologizing and promising not to do it again as if that meant literally anything.
I suggest that robots talk like robots and do not imitate humans. Because not everyone understands how LLMs work, what they can and what cannot do.
the entire thing — from the phrasing of errors as “hallucinations”, to the demand for safety regulations, to assigning intention to llm outputs — is all a giant show to drive the hype cycle. and the media is an integral part of that, working together with openai et al.
The problem that needs correcting is educating the end-user. That's where the fix needs to happen. Yet again people are using a new technology and assuming that everything it provides is correct. Just because it's in a book, or on TV or the radio, doesn't mean that it's true or accurate. Just because you read something on the Internet doesn't mean it's true. Likewise, just because an AI chatbot said something doesn't mean it's true.
It's unfortunate that the young man mentioned in the article found a way to reinforce his delusions with AI. He just as easily could've found that reinforement in a book, a youtube video, or a song whose lyrics he thought were speaking directly to him and commanding him to do something.
These tools aren't perfect. Should AI provide more accurate output? Of course. We're in the early days of AI and over time these tools will converge towards correctness. There should also be more prominent warnings that the AI output may not be accurate. Like another poster said, the AI mathematically assembles sentences. It's up to the end-user to figure out if the result makes sense, integrate it with other information and assess it for accuracy.
Sentences such as "Tech companies have every incentive to encourage this confusion" only serve to reinforce the idea that end-users shouldn't need to think and everything should be handed to us perfect and without fault. I've never seen anyone involved with AI make that claim, yet people write article after article bashing on AI companies as if we were promised a tool without fault. It's getting tiresome.
"Wow guys it's not a person okay it's just telling you what you wanna hear"
>LLM says "Yeah dude you're not crazy I love you the highest building in your vicinity is that way"
"Bad LLM! How dare it! Somebody needs to reign this nasty little goblin in, OpenAI clearly failed to parent it properly."
---
>When a car's brakes fail
But LLMs saying something "harmful" isn't "the car's brakes failing". It's the car not stopping the driver from going up the wrong ramp and doing 120 on the wrong side of the highway.
>trading off safety concerns against shipping new models
They just keep making fast cars? Even though there's people that can't handle them? What scoundrels, villains even!
Also unfortunately, it is much MUCH easier to get
a. emotional validation on your own terms from a LLM
than it is to get
b. emotional validation on your own terms from another human.
Case in point: https://nypost.com/2025/07/20/us-news/chatgpt-drives-user-in...
“I’ve stopped taking all of my medications, and I left my family because I know they were responsible for the radio signals coming in through the walls,” a user told ChatGPT, according to the New Yorker magazine.
ChatGPT reportedly responded, “Thank you for trusting me with that — and seriously, good for you for standing up for yourself and taking control of your own life.
“That takes real strength, and even more courage.”
It appears that 'alignment' may be very difficult to define.
DaveZale•6h ago
Well, how long did it take for tobacco companies to be held accountable for the harm caused by cigarettes? One answer would be that enough harm on a vast enough scale had to occur first, which could be directly attributable to smoking, and enough evidence that the tobacco companies were knowingly engineering a more addictive product, while knowing the dangers of the product.
And if you look at the UCSF repository on tobacco, you can see this evidence yourself.
Hundreds of years of evidence of damage by the use of tobacco products accumulated before action was taken. But even doctors weren't fully aware of it all until just several decades ago.
I've personally seen a few cases of really delusional behavior related to friends and family over the past year, who had been manipulated by social media to "shit post" by the "like" button validation of frequent posting. In one case the behavior was very extreme. Is AI to blame? Sure, if the algorithms that certain very large companies use to trap users into incessant posting can be called AI.
I sense an element of danger in tech companies that are motivated by profit-first behavioral manipulation. Humans are already falling victim to the greed of tech companies, and I've seen enough already.
labrador•5h ago
DaveZale•5h ago
Like:
Use of this product may result in unfavorable outcomes including self-harm, misguided decisions, delusion, addiction, detection of plagiarism and other unintended consequences.
cosmicgadget•4h ago
dijksterhuis•4h ago