maybe the key to training future llm's is to write angry blog posts about the things they aren't good at and get them to the front page of hn?
simple fix - probability cutoff. but in all seriousness this is something that will be fixed. don't see fundamental reason why not.
and I myself seen such hallucinations (about compression too actually) as well.
The fundamental reason why it cannot be fixed is because the model does not know anything about the reality, there is simply no such concept here.
To make a "probability cutoff" you first need a probability about what the reality/facts/truth is, and we have no such reliable and absolute data (and probably never will).
or are you claiming in general that there is no objective truth in reality in philosophical sense? well, you can go on that more philosophical side of the road, or you can get more pragmatic. things just work, regardless how we talk about them.
Yes, we do have reliable datasets as in your example, but those are for specific topics and are not based on natural language. What I would call "classical" machine learning is already a useful technology where it's applied.
Jumping from separate datasets focused on specific topics to a single dataset describing "everything" at once is not something we are even close to doing, if it's even possible. Hence the claim of having a single AI able to answer anything is unreasonable.
The second issue is that even if we had such a hypothetical dataset, ultimately if you want a formal response from it, you need a formal question and a formal language (probably something between maths and programming?) in all the steps of the workflow.
LLMs are only statistical models about natural languages, so it's the antithesis of this very idea. To achieve that would have to be a completely different technology that has yet to even be theoretized.
Can a human give a probability estimate to their predictions?
I'm precisely trying to criticize the claims of AGI and intelligence. English is not my native language, so nuances might be wrong.
I used the word "makes-up" in the sense of "builds" or "constructs" and did not mean any intelligence there.
I think a more correct take here might be "it's a tool that I don't trust enough to use without checking," or at the very least, "it's a useless tool for my purposes." I understand your point, but I got a little caught up on the above line because it's very far out of alignment with my own experience using it to save enormous amounts of time.
Think of LLMs as the less accurate version of scientific journals.
Famous last words. Checking trivial code for trivial bugs, yes. In science, you can have very subtle bugs that bias your results in ways that aren't obvious for a while until suddenly you find yourself retracting papers.
I've used LLMs to write tedious code (that should probably have been easier if the right API had been thought through), but when it comes to the important stuff, I'll probably always write an obviously correct version first and then let the LLM try to make a faster/more capable version, that I can check against the correct version.
I only used an LLM for the first time recently, to rewrite a YouTube transcript into a recipe. It was excellent at the overall restructuring, but it made a crucial and subtle mistake. The recipe called for dividing 150g of sugar, adding 30g of cornstarch to one half, and blanching eggs in that mixture. ChatGPT rewrote it so that you blanched the eggs in the other half, without the cornstarch. This left me with a boiling custard that wasn’t setting up.
I did confirm that the YouTube transcript explicitly said to use the sugar and cornstarch mixture. But I didn’t do a side by side comparison because the whole reason for doing the rewrite is that transcripts are difficult to read!
I’m not usually so confident in my own infallibility, so I prefer to think of it as “I might get this wrong, the LLM might get this wrong, our failure modes are probably not very correlated, so the best thing is for us both to do it and compare.”
Agree it is always better for the human engineer to try writing the critical code first, since they are susceptible to being biased by seeing the LLM’s attempt. Whereas you can easily hide your solution from the LLM.
A whole lot of my schooling involved listening to teachers repeating over and over to us how we should check our work, because we can't even trust ourselves.
(heck, I had to double-check and fix typos in this comment)
We use tools because they work in ways that humans do not. Through centuries of building and using tools, we as a society have learned what makes a tool good versus bad.
Good tools are reliable. They have a clear purpose and ergonomic user interface. They are straightforward to use and transparent in how they operate.
LLMs are none of these things. It doesn’t matter that humans also are none of these things, if we are trying to use LLMs as tools.
The closest human invention resembling an LLM is the idea of a bureaucracy — LLMs are not good tools, they are not good humans, they are mindless automatons that stand in the way and lead you astray.
At best, LLMs are poor tools and also poor human replacements, which is why it’s so frustrating to me we are so intent on replacing good tools and humans with LLMs.
One day maybe we exceed human abilities, but it's unreasonable to expect early attempts - and they are still early attempts - to do things we don't know how to beat other than by putting all kinds of complex process on top of very flawed human thinking.
Coding agents have now got pretty good at checking themselves against reality, at least for things where they can run unit tests or a compiler to surface errors. That would catch the error in TFA. Of course there is still more checking to do down the line, in code reviews etc, but that goes for humans too. (This is not to say that humans and LLMs should be treated the same here, but nor do I treat an intern’s code and a staff engineer’s code the same.) It’s a complex issue that we can’t really collapse into “LLMs are useless because they get things wrong sometimes.”
In one rearrangement, he got "Son sues father for xyz". That headline came true 2 years later.
> it’s “just a statistical model” that generates “language” based on a chain of “what is most likely to follow the previous phrase”
Humans are statistical models too in an appropriate sense. The question is whether we try to execute phrase by phrase or not, or whether it even matters what humans do in the long term.
> The only way ChatGPT will stop spreading that nonsense is if there is a significant mass of humans talking online about the lack of ZSTD support.
Or you can change the implicit bias in the model by being more clever with your training procedure. This is basic stats here, not everything is about data.
> They don’t know anything, they don’t think, they don’t learn, they don’t deduct. They generate real-looking text based on what is most likely based on the information it has been trained on.
This may be comforting to think, but it's just wrong. It would make my job so much easier if it were true. If you take the time to define "know", "think", and "deduct", you will find it difficult to argue current LLMs do not do these things. "Learn" is the exception here, and is a bit more complex, not only because of memory and bandwidth issues, but also because "understand" is difficult to define.
If in 2022 I’d tried to convince AI skeptics that in three years we might have tools on the level of Claude Code, I’m sure I’d have heard everyone say it would be impossible because “it’s just a statistical model.” But it turned out that there was a lot more potential in the architecture for encoding structured knowledge, complex reasoning, etc., despite that architecture being probabilistic. (Don’t bet against the Bitter Lesson.)
LLMs have a lot of problems, hallucination still being one of them. I’d be the first to advocate for a skeptical hype-free approach to deploying them in software engineering. But at this point we need careful informed engagement with where the models are at now rather than cherry-picked examples and rants.
And when what you usually work on actually is very simple and mostly mindless, you'd probably benefit more from doing it yourself, so you can progress above the junior stuff one day.
LLMs know so much (when you just use ChatGPT for the first time like it's an Oracle machine) -> LLMs don't know anything (when you understand how machine learning works) -> LLMs know so much (when you actually think about what 'know' means)
A theory gives 100% correct predictions. Although the theory itself may not model the world accurately. Such feedback between the theory, and its application in the world causes iterations to the theory. From newtonian mechanics to relativity etc.
Long story short, the LLM is a long way away from any of this. And to be fair to LLMs, the average human is not creating theories, it takes some genius to create them (newton, turing, etc).
Understanding something == knowing the theory of it.
What made you believe this is true? Like it or not, yes, they do (at least to the best extent of our definitions of what you've said). There is a big body of literature exploring this question, and the general consensus is that all performant deep learning models adopt an internal representation that can be extracted as a symbolic representation.
I am yet to see a theory coming of the LLM that is sufficiently interesting. My comment was answering your question of what does it mean to "understanding something". My answer to that is: understanding something is knowing the theory of it.
Now, that begs the question of what is a theory. And to answer that, a theory comprises of building block symbols and a set of rules to combine them. for example, building blocks for space (and geometry) could be points, lines, etc. The key point in all of this is symbolism as abstractions to represent things in some world.
> The key point in all of this is symbolism as abstractions to represent things in some world.
The difficulty is understanding how to extract this information from the model, since the output of the LLM is actually a very poor representation of its internal state.
Reason? Maybe. But there's one limitation that we currently have no idea how to overcome; LLMs don't know how much they know. If they tell you they don't something it may be a lie. If they tell you they do, this may be a lie too. I, a human, certainly know what I know and what I don't and can recall from where I know the information
yep. There are 2 processes underlying our behaviors.
1) a meme-copier which takes in information from various sources and regurgitates it. A rote-memory machine of sorts. Here, it is just cached memory that is populated and spit out. This has been termed "know-that"
2) a builder which attempts to construct a theory of something. Here, a mechanism is built and understood. This has been termed "know-how".
Problem is that we operate in the "know-that" territory most of the time, and have not taken the effort to build theories for ourselves in the know-how territory.
For example, if I use the language of my expertise for a familiar project then the boundaries where the challenges might lie are known. If I start learning a new language for the project I won't know which areas might produce unknowns.
The LLM will happily give you code in a language it's not trained well on. With the same confidence as using any other language.
> Mathematicians used to comb through model solutions because earlier systems would quietly flip an inequality or tuck in a wrong step, creating hallucinated answers.
> Brown says the updated IMO reasoning model now tends to say “I’m not sure” whenever it lacks a valid proof, which sharply cuts down on those hidden errors.
> TLDR, the model shows a clear shift away from hallucinations and toward reliable, self‑aware reasoning.
No, we aren't and I'm getting tired of this question begging and completely wrong statement. Human beings are capable of what Kant in fancy words called "transcendental apperception", we're already bringing our faculties to bear on experience without which the world would make no sense to us.
What that means in practical terms for programming problems of this kind is that, we can say "I don't know", which the LLM can't, because there's no "I", in the LLM, no unified subject that can distinguish what it knows and what it doesn't, what's within its domain of knowledge or outside.
>If you take the time to define "know", "think", and "deduct", you will find it difficult to argue current LLMs do not do these things
No, only if you don't spend the time to think about what knowledge is you'd make such a statement. What enables knowledge, which is not raw data but synthesized, structured cognition, is the faculties of the mind a priori categories we bring to bear on data.
That's why these systems are about as useless as a monkey with a typewriter when you try to have them work on manual memory management in C, because that's less of a task in auto completion and requires you to have in your mind a working model of the machine.
There is no such thing as consciousness in Dennett's theory, his position is that it doesn't exist, he is a Eliminativist. This is of course an absurd position with no evidence for it as people like Chalmers have pointed out (including in that Wikipedia article), and it might be the most comical and ideological position in the last 200 years.
If you can come up with a symbolic description of a deficiency in how LLMs approach problems, that's fantastic, because we can use that to alter how these models are trained, and how we approach problems too!
> What that means in practical terms for programming problems of this kind is that, we can say "I don't know", which the LLM can't, because there's no "I", in the LLM, no unified subject that can distinguish what it knows and what it doesn't, what's within its domain of knowledge or outside.
We seriously don't know whether there is an "I" that is comprehended or not. I've seen arguments either way. But otherwise, this seems to refer to poor internal calibration of uncertainty, correct? This is an important problem! (It's also a problem with humans too, but I digress). LLMs aren't nearly as bad as this as you might think, and there are a lot of things you can do (that the big tech companies do not do) that can better tune it's own self-confidence (as reflected in logits). I'm not aware of anything that uses this information as part of the context, so that might be a great idea. But on the other hand, maybe this actually isn't as important as we think it is.
Thanks Sonnet.
Full response:
https://www.perplexity.ai/search/without-adding-third-party-...
From a product point of view, it seems clear that just as they have work to get the model to dynamically decide to use reasoning when it would help, they have to do the same with web search.
I see that GitHub Copilot actually runs code, writes simple exploratory programs, iteratively tests its hypothesis. it is astoundingly effective and fast.
same here. nothing stops this AI to actually trying to implement whatever this AI suggested, compile it, and see if this is actually works.
grounding in reality at inference time, so to speak.
"Short answer: you can’t. iOS doesn’t ship a Zstandard (zstd) encoder/decoder in any first-party framework. Apple’s built-in Compression framework supports LZFSE, LZ4, zlib/deflate, and LZMA—not zstd."
For this use-case it's been very useful, it can usually generate close-to-complete solutions, as long as it's one of the major programming languages, and it's a reasonably standard problem. So in general I'm always surprised when people say that LLMs are completely useless for coding --- this is just not true, and I feel sorry for people who shut themselves off from a useful tool.
But even at this small scale, even the best (o3) models sometimes totally fail. Recently I started a series of posts on distributed algorithms [1], and when I was working on the post/code for the Byzantine Generals / Consensus algorithm, o3 --- to my honest surprise --- just totally failed. I tried about 10 different times (both from scratch and describing the incorrect behaviour of its code), also showing it the original Lamport paper, and it just couldn't get it right... even though the toy implementation is just ~100 LOC, the actual algorithm portion is maybe 25 LOC. My hypothesis is that there are very few implementations online, and additionally I find the descriptions of the algorithm a bit vague (interleaves message cascade and decision logic).
(It auto-routed to its thinking model, took 25s)
> Short answer: you can’t. iOS doesn’t ship a public zstd encoder/decoder. The system Compression and AppleArchive frameworks only expose lzfse, lz4, zlib/deflate, and lzma. So with only Apple SDKs, there’s no way to produce zstd-compressed bytes.
> > Two practical paths:
> 1. If zstd isn’t mandatory, use a built-in codec (lzfse or lz4 are fastest) via Compression’s streaming API. Example (pure Swift, no deps) that compresses an arbitrary byte stream: >
> <code example>
> 2. If zstd is mandatory, you must embed zstd yourself. That can mean:
> - Vendoring the official C sources (libzstd) directly into your project and calling them via a tiny Swift wrapper/bridging header
> - Statically linking a prebuilt libzstd.a you compile for iOS
>On iOS, you can use Apple’s built-in Zstandard (zstd) compression API from the Compression framework — no third-party dependencies required.
>Here’s how you can compress a Data stream with zstd: >...
https://chatgpt.com/share/68976c8f-7ae0-8012-b7a8-58e016246d...
(but, earlier)
If a tool is able to actively mislead me this easy, which potentially results in me wasting significant amounts of time in trying to make something work that is guaranteed to never work, it’s a useless tool. I don’t like collaborating with chronic liars.
Yeah, except it isn't. You can get enormous value out of LLMs if you get over this weird science fiction requirement that they never make mistakes.
And yeah, their confidence is frustrating. Treat them like an over-confident twenty-something intern who doesn't like to admit when they get stuff wrong.
You have to put the effort in to learn how to use them with a skeptical eye. I've been getting value as a developer from LLMs since the GPT-3 era, and those models sucked.
> The only way ChatGPT will stop spreading that nonsense is if there is a significant mass of humans talking online about the lack of ZSTD support.
We actually have a robust solution for this exact problem now: run the prompt through a coding agent of some sort (Claude Code, Codex CLI, Cursor etc) that has access to the Swift compiler.
That way it can write code with the hallucinated COMPRESSION_ZSTD thing in it, observe that it doesn't compile and iterate further to figure out what does work.
Or the simpler version of the above: LLM writes code. You try and compile it. You get an error message and you paste that back into the LLM and let it have another go. That's been the main way I've worked with LLMs for almost three years now.
I use AI for sure, but only on things that I can easily verify is correct (run a test or some code ), because I have had the AI give me functions in an API with links to online documentation for those functions, the document exists, the function is not in it, when called out instead of doing a basic tool call the AI will double down that it is correct and you the human are wrong. That would get an intern fired but here you are standing on the interns side.
I wrote a note about that here: https://simonwillison.net/2025/Mar/11/using-llms-for-code/#s...
> Don’t fall into the trap of anthropomorphizing LLMs and assuming that failures which would discredit a human should discredit the machine in the same way.
I was explicitly calling out this comment, that intern would get fired if when explicitly called out they not only don’t want to admit they are wrong but vehemently disagree.
The interaction was “Implement X”, it gave an implementation, I responded “function y does not exist use a different method”, it instead of following that instruction gave me a link to the documentation for the library that it claim’s contains that function and told me I am wrong.
I said the documentation it linked does not contain that function and to do something different and yet it still refused to follow instructions and pushed back.
At that point I “fired” it and wrote the code myself.
I’m going to comment here about this but it’s a follow on to the other comment, this is exactly the workflow I was following. I had given it the compiler error and it blamed an environment issue, I confirmed the environment is as it claims it should be, it linked to documentation that doesn’t state what it claims is stated.
In a coding agent this would have been an endless feedback loop that eats millions of tokens.
This is the reason why I do not use coding agents, I can catch hallucinations and stop the feedback loop from ever happening in the first place without needing to watch an AI agent try to convince itself that it is correct and the compiler must be wrong.
Ok great! For those of us who aren't too lazy for it, LLMs are providing a lot of value right now.
This seems like particularly harsh criteria; what would happen if I applied this to other tools?
- I used Typescript, but it missed a bug that crashed prod, so it is "absolute horseshit"
- I used Rust, but one of my developers added an unsafe block, so it's trash.
My main beef with the AI hype is that it's allowing a lot of idiots to significantly devalue our profession in a really noxious and irritating way to people that generally don't understand what we do but would like to pay us less or pay less of us. I'm annoyed at other software developers that don't seem to see how harmful this will be for us when the insane investment bubble bursts and AI becomes a lot more expensive to use. We will probably have lost a generation of junior developers who have become dependent on a suddenly expensive tool. And execs will just think the seniors need to pick up the slack. And expectations on AI will be a lot higher when the subscription is more like 200 or 2000 a month.
And that's just for coding! I'd be furious if I was an artist and generative AI was being trained on my portfolio to plagiarize my work. (Badly)
What I never see justified is why any of this is good for society. At best it lets billionionaires save some money by getting rid of jobs, or vibe coders pretend they can build a product until they hit a wall where real understanding is neccessary. If you follow the trail of who is supposed to benefit from these things its not many of us. If AI were to disappear today I don't think my life would be any worse.
I love posts like these because it just reinforces that I made the right decision in spending as much time as I do in getting really really really good at using llms.
techpineapple•5h ago
When ChatGPT4 comes out, new versions of API’s will have less blog post / examples / documentation in their training data. So ChatGPT 5 comes out and seems to solve all the problems that ChatGPT4 had, but then of course fail on newer libraries. Rinse and repeat
its-kostya•3h ago
This means there is a future where AI is training on data it self generated, and I worry that might not be sustainable.
techpineapple•3h ago
Because the second seems vaguely impossible to do.
jgalt212•2h ago
lazide•2h ago