We are killing thousands on the road to be sure we can blame a driver instead of a computer as one example.
Of course there is valuable knowledge in understanding limitations but that is not the approach the author is taking here, imo the author seems disingenuous.
I use LLMs for language-related work (translations, grammatical explanations etc) and they are top notch in that as long as you do not ask for references to particular grammar rules. In that case they will invent non-existent references.
They are also good for tutor personas: give me jj/git/emacs commands for this situation.
But they are bad in other cases.
I started scanning books recently and wanted to crop the random stuff outside an orange sheet of paper on which the book was placed before I handed the images over to ScanTailor Advanced (STA can do this, but I wanted to keep the original images around instead of the low-quality STA version). I spent 3-5 hours with Gemini 2.5 Pro (AI Studio) trying to get it to give me a series of steps (and finally a shell script) to get this working.
And it could not do it. It mixed up GraphicsMagick and ImageMagick commands. It failed even with libvips. Finally I asked it to provide a simple shell script where I would provide four pixel distances to crop from the four edges as arguments. This one worked.
I am very surprised that people are able to write code that requires actual reasoning ability using modern LLMs.
I once asked it to read a postcard written by my late grandfather in Polish, as I was struggling to decipher it. It incorrectly identified the text as Romanian and kept insisting on that, even after I corrected it: "I understand you are insistent that the language is Polish. However, I have carefully analyzed the text again, and the linguistic evidence confirms it is Romanian. Because the vocabulary and alphabet are not Polish, I cannot read it as such." Eventually, after I continued to insist that it was indeed Polish, it got offended and told me it would not try again, accusing me of attempting to mislead it.
I once had Claude tell me to never talk to it again after it got upset when I kept giving it peer reviewed papers explaining why it was wrong. I must have hit the tumbler dataset since I was told I was sealioning it, which took me back a while.
It is the only way to do real image work these days, and as a bonus LLMs suck a lot less at giving you nearly useful python code.
The above is a bit of a lie as opencv has more capabilities, but unless you are deep in the weeds of preparing images for neural networks pillow is plenty good enough.
The most effective analogy I have found is comparing LLMs to theater and film actors. Everyone understands that, and the analogy offers actual predictive power. I elaborated on the idea if you're curious to read more:
Do you know what a "coincidence" actually is? The definition you're using is wrong.
It's not a coincidence that I train a model on healthcare regulations and it answers a question about healthcare regulations correctly.
None of that is coincidental.
If I trained it on healthcare regulations and asked it about recipes, it won't get anything right. How is that coincidental?
If you train a model on only healthcare regulations it wont answer questions about healthcare regulation, it will produce text that looks like healthcare regulations.
Huh? But it does do that? What do you think training an LLM entails?
Are you of the belief that an LLM trained on non-medical data would have the same statical chance of answering a medical question correctly?
we're at the "Redefining what words means in order to not have to admit I was wrong" stage of this argument
That's not what a coincidence is.
A coincidence is: "a remarkable concurrence of events or circumstances without apparent causal connection."
Are you saying that training it on a subset of specific data and it responding with that data "does not have a causal connection"> Do you know how statistical pattern matching works?
It's not coincidence that the answer contains the facts you want. That is a direct consequence of the question you asked and the training corpus.
But the answer containing facts/Truth is incidental from the LLMs point of view, in that the machine really does not care, nor even have any concept of whether it gave you the facts you asked for or just nice-sounding gibberish. The machine only wants to generate tokens, everything else is incidental. (To the core mechanism, that is. OpenAI and co obviously care a lot about quality and content of the output)
They are useful. It's not a coin flip as to whether Bolt will produce a new design of a medical intake form for me if I ask it to. It does. It doesn't randomly give me a design for a social media app, for instance.
Who would do this manually? Concatenate the two lists and sort them. Use "uniq -c" to count the duplicate lines and grep to pull out the lines which occur twice. It would take a few seconds.
This made me laugh. Because it's the exact opposite sentiment of anti-LLM crowd. So which is it? Is it only useful if you know what you're doing or less useful if you know what you're doing?
> "I can't wait until I can jack into the Metaverse and buy an NFT with cryptocurrency just by using an LLM! Perhaps I can view it on my 3D TV by streaming it over WIMAX? I'd better stock up on quantum computers to make sure it all works."
In the author's attempt to be a smartass, they showed themselves. It makes them sound childish. Instead of just admitting they were wrong, they make some flippant remark about cryptocurrency and NFT'S, despite having vastly different purposes and goals and successes. Just take the L.
to add: "I shouldn't have to know anything about LLMs to use them correctly" is one heck of a take, but ok.
> "I don't. I hate the way this is being sold as a universal and magical tool. The reality doesn't live up to the hype."
And I hate the way in which people will do the opposite: claim it has no uses cases. It's literally the same sentiment, but in reverse. It's just as myopic and naive. But for whatever reason, we can look at a CEO hawking it and think "They're just trying to make more money" but can't see the flipside of devs not wanting to lose their livelihoods to something. We have just as much to lose as they have to gain, but want to pretend like we're objective.
This continues a pattern as old as home computing: The author does not understand the task themselves, consequently "holds the computer wrong", and then blames the machine.
No "lists" were being compared. The LLM does not have a "list of TLDs" in its memory that it just refers to when you ask it. If you haven't grokked this very fundamental thing about how these LLMs work, then the problem is really, distinctly, on your end.
Ask a stupid question, get a stupid answer.
Ok, I only have to:
1. Generally solve the problem for the AI
2. Make a step by step plan for the AI to execute
3. Debug the script I get back and check by hand if it uses reliable sources.
4. Run that script.
For what do I need the AI?
Also, you are literally describing how you are holding it wrong. If you expect the LLM to magically know what you want from it without you yourself having to make the task understandable to the machine, you are standing in front of your dishwasher waiting for it to grow arms and do your dishes in the sink.
How would you solve that problem? You'd probably go to the internet, get the list of TLDs and the list of HTML5-Element and than compare those lists.
The author compares three commercial large‑language models that have direct internet access, but none of them appear capable of performing this seemingly simple task. I think his conclusion is valid.
That's when I realized this site was making heavy use of AI. Sadly, lots of people are going to trust but not verify...
> A factor of 1966 is a number that divides the number without remainder.
>The factors of 1966 are 1, 2, 3, 6, 11, 17, 22, 33, 34, 51, 66, 102, 187, 374, 589, 1178, 1966.
If I google for the factors of 1966 the Google AI gives the same wrong factors.
Tried on ChatGPT, seems fine.
It's consistently missing `search` for all of us.
[1] https://shkspr.mobi/blog/2023/09/false-friends-html-elements...
However, I do superficially agree with some of the links at the end. LLMs as they have been so far are confirmation machines and it does take skill to use them effectively. Or knowing when not to use them.
Except this microwave is advertised as also for steaks. And sometimes it works, and sometimes you cannot even warm milk in it. It's totally not reliable.
> I know this question is possible to answer _because I went through the lists two years ago_.
Is this the right answer? Seems like it. I used the thinking model.
By now, numerous notable programmers have reported positive experiences with all forms of AI-assisted coding, which this conclusion arrogantly fails to account for.
As a ChatGPT user I would have reached for the thinking model for such questions. I understand if the “auto” model doesn’t pick the right model here - but confident claims from the author should be backed up by at least this much.
Go sit on public transport and look at how people use their devices. They don't fiddle with settings or dive deep into configuration menus.
I literally just opened the tools and used what they gave me. They're sold on the promise that "this thing is really clever and will answer any question!!" so why should I have to spend time futzing with it?
sure, but when I expect this [1] from _any_ full time hire, my "expectations are too high from people" and "everybody has their strengths"
[1] find a list of valid html5 elements, find a list of TLDs, have an understanding of ccTLDs and gTLDs
uint16_t ea_indexed(void)
{
uint8_t post = *PC++;
uint16_t base, off = 0;
/* 1. pick base register */
static const uint16_t *const base_tbl[4] = { &X, &Y, &U, &S };
base = *base_tbl[(post >> 5) & 3];
/* 2. work out the effective address */
if ((post & 0x80) == 0) { /* 5-bit signed offset */
off = (int8_t)(post << 3) >> 3;
} else if ((post & 0x60) == 0x20) { /* 8- or 16-bit offset */
if (post & 0x10) { /* 16-bit */
off = (int16_t)fetch_be16(PC);
PC += 2;
} else { /* 8-bit */
off = (int8_t)*PC++;
}
} else if ((post & 0x60) == 0x40) { /* auto inc/dec */
int8_t step = ((post & 0x0F) == 0x0) ? 1 :
((post & 0x0F) == 0x1) ? 2 :
((post & 0x0F) == 0x2) ? -1 :
((post & 0x0F) == 0x3) ? -2 : 0;
if (step > 0) base += step; /* post-increment */
off = step < 0 ? step : 0; /* pre-decrement already applied */
if (step < 0) base += step;
} else if ((post & 0x60) == 0x60) { /* accumulator offset */
static const uint8_t scale[4] = {1,1,2,1}; /* A,B,D,illegal */
uint8_t acc = (post >> 3) & 3;
if (acc == 0) off = A;
else if (acc == 1) off = B;
else if (acc == 2) off = (A<<8)|B; /* D */
off *= scale[acc];
} else { /* 11x111xx is illegal */
illegal();
}
uint16_t ea = base + off;
/* 3. optional indirect */
if (post & 0x10) ea = read16(ea);
return ea;
}
( full convo: https://text.is/4ZW2J )From looking at Page 150 of https://colorcomputerarchive.com/repo/Documents/Books/Motoro... it looked pretty much perfect except for the accumulator addressing. That's impressive...
Then in another chat I asked it "give a technical description of how the 6809 indexed operands are decoded" and it just can't do it. It always gets the fundamentals wrong and makes pretty much everything up. Try it yourself, doesn't have to be Kimi most other AIs get it wrong too.
My assumption is that it's learned to how to represent it in code from reading emulator sources, but hasn't quite mapped it well enough to be able to explain it in English.. or something like that.*
To do a task like this with LLMs, you need to use a document for your source lists or bring them directly into context, then a smart model with good prompting might zero-shot it.
But if you want any confidence in the answer, you need to use tools: “here is two lists, write a python script to find the exact matches, and return a new list with only the exact matches. Write a test dataset and verify that there are no errors, omissions, or duplicates.”
LLMs plus tools / code are amazing. LLMs on their own are a professor with an intermittent heroin problem.
This is my personal favourite example of LLMs being stupid. It's a bit old but it's very funny that Grok is the only one that gets it..
First there's the tokenization issue, the same old "how many R in STRAWBERRY" where they are often confidently wrong, but I also asked not to mix tense (-ing and -ed for example) and that was very hard for them.
jw1224•1h ago
> yoUr'E PRoMPTiNg IT WRoNg!
> Am I though?”
Yes. You’re complaining that Gemini “shits the bed”, despite using 2.5 Flash (not Pro), without search or reasoning.
It’s a fact that some models are smarter than others. This is a task that requires reasoning so the article is hard to take seriously when the author uses a model optimised for speed (not intelligence), and doesn’t even turn reasoning on (nor suggest they’re even aware of it being a feature).
I asked the exact prompt to ChatGPT 5 Thinking and got an excellent answer with cited sources, all of which appears to be accurate.
softwaredoug•1h ago
Search and reasoning use up more context, leading to context rot, and subtler harder to detect hallucinations. Reasoning doesn’t always focus on evaluating the quality of evidence, just “problem solving” from some root set of axioms found in search.
I’ve had this happen in Claude code for example where it hallucinated a few details about a library based on what badly written forum post.
edent•1h ago
Or, if LLMs are so smart, why doesn't it say "Hmmm, would you like to use a different model for this?"
Either way, disappointing.
magicalhippo•45m ago
That is indeed an area where LLMs don't shine.
That is, not only are they trained to always respond with an answer, they have no ability to accurately tell how confident they are in that answer. So you can't just filter out low confidence answers.
mathewsanders•33m ago
I’m presuming that one class of junk/low quality output is when the model doesn’t have high probability next tokens and works with whatever poor options it has.
Maybe low probability tokens that cross some threshold could have a visual treatment to give feedback the same way word processors give feedback in a spelling or grammatical error.
But maybe I’m making a mistake thinking that token probability is related to the accuracy of output?
StilesCrisis•9m ago
pwnOrbitals•38m ago
delusional•36m ago
hobofan•36m ago
> Or, if LLMs are so smart, why doesn't it say "Hmmm, would you like to use a different model for this?"
That's literally what ChatGPT did for me[0], which is consistent from what they shared at the last keynote (quick-low reasoning answer per default first, with reasoning/search only if explicitly prompted or as a follow-up). It did miss one match tough, as it somehow didn't parse the `<search>` element from the MDN docs.
[0]: https://chatgpt.com/share/68cffb5c-fd14-8005-b175-ab77d1bf58...
maddmann•36m ago
delusional•50m ago
I think the authors point stands.
EDIT: I tried it with "Deep Research" too. Here it doesn't invent either TLDs or HTML Element, but the resulting list is incomplete.
[1]: https://en.wikipedia.org/wiki/.bi
dgfitz•27m ago
Isn’t that the whole goddamn rub? You don’t _know_ if they’re accurate.