frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Show HN: MockinglyAI On-Demand AI Interviewer for System Design Mock Interviews

https://www.mockingly.ai/
1•ayusch_•4m ago•0 comments

Show HN: Tabwise – AI data analyst that outperforms ChatGPT and Claude

https://www.brandbucket.com/names/tabwise
1•abhegd•4m ago•0 comments

Thunderbird Pro August 2025 Update

https://blog.thunderbird.net/2025/08/tbpro-august-2025-update/
3•mnmalst•6m ago•1 comments

Show HN: Ginormous News, daily global news briefings from radio

https://ginormous.news/
2•julianchr•6m ago•0 comments

Powell Highlights Job Market Worries, Opening Path to Rate Cut

https://www.wsj.com/economy/central-banking/powell-highlights-job-market-worries-opening-path-to-rate-cut-4169a7ba
1•pera•8m ago•0 comments

Max, the 'chilling' new development in Putin's 'digital iron curtain'

https://www.abc.net.au/news/2025-08-22/inside-max-the-new-kremlin-controlled-messenger-in-russia/105683386
2•onychomys•9m ago•0 comments

The decline in reading for pleasure over 20 years

https://www.cell.com/iscience/fulltext/S2589-0042(25)01549-4
2•M95D•10m ago•1 comments

Ruby for Good

https://ti.to/codeforgood/rubyforgood
1•mooreds•10m ago•0 comments

Simplex noise demystified (2005) [pdf]

https://cgvr.cs.uni-bremen.de/teaching/cg_literatur/simplexnoise.pdf
1•wy35•12m ago•0 comments

Deeply Trump Has Cut Federal Health Agencies

https://projects.propublica.org/federal-health-worker-cuts-rfk-trump-administration/
1•speckx•12m ago•0 comments

Heatmap your Apple training data

https://huggingface.co/spaces/filipacsr/heatmap-my-training
1•filipacsr•12m ago•0 comments

Show HN: Lacquer – GitHub Actions for AI workflows in a single Go binary

https://github.com/lacquerai/lacquer
7•hugorut•13m ago•1 comments

Features I Wish MySQL Had but Postgres Has

https://www.bytebase.com/blog/features-i-wish-mysql-had-but-postgres-already-has/
2•ksec•13m ago•0 comments

Stable cortical body maps before and after arm amputation

https://www.nature.com/articles/s41593-025-02037-7
1•bookofjoe•14m ago•0 comments

Train a small GPT to solve mazes

https://github.com/martianlantern/MazeGPT
1•martianlantern•14m ago•0 comments

After the Hack: System Shock and Transhumanism [video]

https://www.youtube.com/watch?v=f-pgcX9yTGM
1•dsego•15m ago•0 comments

US to review all 55M visas to check if holders broke rules

https://www.bbc.com/news/articles/cvg04gm92d3o
3•vinni2•15m ago•1 comments

Using GPT‑Driver Vision AI Cuts Groupon's Mobile Regression from ~20 H to ~10 H

https://www.mobileboost.io/post/using-gpt-driver-vision-ai-cuts-groupon-mobile-regression-20h-to-10h
1•chrtng•18m ago•0 comments

How Konvu got its name

https://konvu.com/blog/how-konvu-got-its-name
1•zetaben•18m ago•0 comments

ChatGPT is pulling from Google Search to answer your questions

https://www.tomsguide.com/ai/chatgpt-is-secretly-using-google-search-data-heres-how
2•thm•19m ago•1 comments

The Aneurothymia Spectrum: Divergent Formation of Neurodevelopment

https://cristinagherghel.substack.com/p/the-aneurothymia-spectrum-divergent
1•Neuropsychology•20m ago•0 comments

Why Real-Time Air Quality Data Might Be Making Things Worse

https://www.airgradient.com/blog/pm2.5-time-resolution-and-concentration/
1•ahaucnx•20m ago•0 comments

Busy Beaver Hunters Reach Numbers That Overwhelm Ordinary Math

https://www.quantamagazine.org/busy-beaver-hunters-reach-numbers-that-overwhelm-ordinary-math-20250822/
1•DocFeind•20m ago•0 comments

Building Generative AI Applications with GitHub Models and .NET Aspire

https://www.milanjovanovic.tech/blog/building-generative-ai-applications-with-github-models-and-dotnet-aspire
1•mooreds•21m ago•0 comments

Build Log: Macintosh Classic

https://www.jeffgeerling.com/blog/2025/build-log-macintosh-classic
1•speckx•21m ago•0 comments

The Case of Duke's Still Living Department of Slavic and Eurasian Studies

https://indyweek.com/news/durham/the-strange-case-of-dukes-still-living-department-of-slavic-and-eurasian-studies/
1•mooreds•21m ago•0 comments

Heroic dog crosses busy street to get owners emergency help in Pittsburgh

https://www.theguardian.com/us-news/2025/aug/04/pittsburgh-hero-dog-emergency-help-owners
1•PaulHoule•23m ago•0 comments

A fix for a divergence issue in a popular SQLite CRDT extension

https://github.com/sqliteai/sqlite-sync/blob/main/docs/PriKey.md
1•marcobambini•23m ago•0 comments

Show HN: Twins of Caduceus – A high dexterity twin-stick snake game

https://mordenstar.com/projects/twins-of-caduceus/
2•vunderba•24m ago•0 comments

Open Source vibe coding platform

1•ainiro•24m ago•0 comments
Open in hackernews

Being “Confidently Wrong” is holding AI back

https://promptql.io/blog/being-confidently-wrong-is-holding-ai-back
88•tango12•2h ago

Comments

_Algernon_•2h ago
Rolling weighted dice repeatedly to generate words isn't factually accurate. More at 11.
chpatrick•2h ago
It is if the weights are sufficiently advanced.
blueflow•2h ago
I find such statements frightening. Too many people can not tell the different between prevalence ("everybody does it") and factually correct.
chpatrick•1h ago
Nothing to do with dice though.
blueflow•1h ago
The whole "stochastic means to find factual correctness" thing is an error of method, arguing about weights here is nonsense.
chpatrick•1h ago
It isn't though, the most factually correct human expert is also stochastic. The only question is how the dice are weighted.
blueflow•1h ago
"human expert" as reference for "factually correct", oh just gently caress yourself. Appeal to authority (expert = social status) is as much bullshit as appeal to popularity.
chpatrick•57m ago
Right now the fully deterministic always correct oracle machine doesn't exist. The most authorative answer we can get on a subject is from a respected human in their field (who is still stochastic). It's unrealistic to hold LLMs to a higher standard than that.
blueflow•53m ago
"The runway is free"

- Jacob Veldhuyzen van Zanten, respected aviation expert, 1977 teneriffa, brushing off the flight engineers concern about another machine on the runway

chpatrick•50m ago
Ok, so humans are also fallible. Your point being?
Zigurd•2h ago
The weights, so to speak, come from the knowledge base. That means you can't get away from the quality of the knowledge base. That isn't uniform across all domains of knowledge. Then the problem becomes how do you make the training material uniformly high-quality in every knowledge domain? At best it becomes the meta problem of determining the quality of knowledge in some way that makes an LLM able to calibrate confidence to a knowledge domain. But more likely we're stuck with the dubious quality that comes from human bias and wishful thinking in supposedly authoritative material.
chpatrick•1h ago
Sure, it's only as good as the training data. But human experts also output tokens with some statistical distribution. That doesn't mean anything.
Zigurd•1h ago
That sounds plausible. But it doesn't explain why LLM's make laughably bad errors that even a biased and haphazard human researcher wouldn't make.
chpatrick•1h ago
I think that's been a lot less true over the last year or so. Gemini 2.5 Pro is the first LLM I actually find pretty damn reliable.
Zigurd•1h ago
Gemini seems to have a user interface that, for the way most people encounter Gemini, is more closely linked to search results. This leads me to suspect that Google's approach to training could be uniquely informed by both current and historic web crawling.
contagiousflow•1h ago
If you think talking to an LLM is the same experience as talking to a human you should probably talk to more humans
chpatrick•1h ago
That's not what I said. What I said is that the claim "LLMs aren't intelligent because they stochastically produce characters" doesn't hold because humans do that too even if they're intelligent and authorative.
krapp•1h ago
We don't actually know how human cognition works, so how do you know that humans "stochastically produce characters?"
chpatrick•54m ago
Do humans always answer exactly the same way to the same question? No.

Also you could always pick the most likely token in an LLM as well to make it deterministic if you really wanted.

krapp•52m ago
That doesn't really prove anything. I could create a Markov chain with a random seed that doesn't always answer the same question the same way, but that doesn't prove the human brain works like a Markov chain with a random seed.

One thing humans tend not to do is confabulate entirely to the degree that LLMs do. When humans do so, it's considered a mental illness. Simply saying the same thing in a different way is not the same as randomly randomly syntactically correct nonsense. Most humans will not, now and then, answer that 2 + 2 = 5, or that the sun rises in the southeast.

chpatrick•48m ago
I'm not making any claim about how the human brain works. The only thing I'm saying is that humans also produce somewhat randomized output for the same question, which is pretty uncontroversial I think. That doesn't mean they're unintelligent. Same for LLMs.
staticman2•18m ago
I really wish people into LLMs would limit themselves to terms from neuroscience or philosophy when descrbing humans.

You are in my mind rightfully getting pushback for writing "human experts also output tokens with some statistical distribution. "

chpatrick•4m ago
That's just a mathematical fact.

You have a big opaque box with a slot where you can put text in and you can see text come out. The text that comes out follows some statistical distribution (obviously), and isn't always the same.

Can you decide just from that if there's an LLM or a human sitting inside the box? No. So you can't make conclusions about whether the box as a system is intelligent just because it outputs characters in a stochastic manner according to some distribution.

nijave•1h ago
MCP and agents seem like a solutions but as far as I know maintaining sufficient context is still a problem

I.e. ability to plug in expert data sources

Zigurd•1h ago
Find tuning and RAG should, in theory, enable applications of LLM's to perform better in specific knowledge, domains, by focusing annotation of knowledge on the domains specific to the application.
JamesSwift•1h ago
I think youre missing the point. The issue is not the amount of knowledge it possesses. The problem is that theres no way to go from "statistically generate the next word" to "what is your confidence level in the fact you just stated". Maybe, with an enormous amount of computation we could layer another AI on top to evaluate or add confidence intervals, but I just dont see how we get there wihthout another quantum leap.
chpatrick•1h ago
Of course there is. If its training forces it to develop a theory of mind then it will weight the dice so that it's more likely to output "I don't know". Most likely the culprit is that it's hard to make training data for things that it doesn't know.
meindnoch•2h ago
Not just AI.
rokkamokka•2h ago
The interesting question here is if a statistical model like GPTs actually can encode this is a meaningful way. Nobody has quite found it yet, if so
ACCount37•1h ago
They can, and they already do it somewhat. We've found enough to know that.

As the most well known example: Anthropic examined their AIs and found that they have a "name recognition" pathway - i.e. when asked about biographic facts, the AI will respond with "I don't know" if "name recognition" has failed.

This pathway is present even in base models, but only results in consistent "I don't know" if AI was trained for reduced hallucinations.

AIs are also capable of recognizing their own uncertainity. If you have an AI-generated list of historic facts that includes hallucinated ones, you can feed that list back to the same AI and ask it about how certain it is about every fact listed. Hallucinated entries will consistently have less certainty. This latent "recognize uncertainty" capability can, once again, be used in anti-hallucination training.

Those anti-hallucination capabilities are fragile, easy to damage in training, and do not fully generalize.

Can't help but think that limited "self-awareness" - and I mean that in a very mechanical, no-nonsense "has information about its own capabilities" way - is a major cause of hallucinations. An AI has some awareness of its own capabilities and how certain it is about things - but not nearly enough of it to avoid hallucinations consistently across different domains and settings.

rwmj•2h ago
Only thing? Just off the top of my head: That the LLM doesn't learn incrementally from previous encounters. That we appear to have run out of training data. That we seem to have hit a scaling wall (reflected in the performance of GPT5).

I predict we'll get a few research breakthroughs in the next few years that will make articles like this seem ridiculous.

therobots927•1h ago
Apple released this recently: https://machinelearning.apple.com/research/illusion-of-think...
tango12•1h ago
Author here.

You’re right in that it’s obviously not the only problem.

But without solving this seems like no matter how good the models get it’ll never be enough.

Or, yes, the biggest research breakthrough we need is reliable calibrated confidence. And that’ll allow existing models as they are to become spectacularly more useful.

EdNutting•1h ago
The biggest breakthrough that we need is something resembling actual intelligence in AI (human or machine, I’ll let you decide where we need it more ;) )
binarymax•1h ago
You might be getting downvoted because you editorialized your own title. If it’s obviously not the only thing then don’t add that to the title :)
firesteelrain•1h ago
> That we appear to have run out of training data

And now, in some cases for a while, it is training on its own slop.

impossiblefork•1h ago
Having run out of training data isn't something holding back LLMs in this sense.

But I agree that being confidently wrong is not the only thing they can't do. Programming, great, maths, apparently great nowadays, since Google and OpenAI have something that could solve most problems on the IMO, even if the models we get to see probably aren't models that can do this, but LLMs produce crazy output when asked to produce stories, they produce crazy output when given too long confusing contexts and have some other problems of that sort.

I think much of it is solvable. I certainly have ideas about how it can be done.

harsh3195•1h ago
In terms of adoption, I think the user is right. That is the only thing stopping adoption of existing models in the real world.
moduspol•1h ago
Unclear limits on how much context can be reliably provided and effectively used without degrading the result.
lvl155•1h ago
Incrementally learning model is pretty hard. That’s actually something I am working on right now and it’s completely different from developing/implementing LLMs.
criddell•1h ago
I think that's what it's going to take. Eventually put the learning model in a robot body and send it out into the real world where there's no shortage of training data.
tliltocatl•1h ago
Cool, got any previous work to share?
lazide•1h ago
The article is the peak of confidently wrong itself, for solid irony points.
ninetyninenine•1h ago
It does. We keep a section of the context window for memory. The LLM however is the one deciding what is remembered. Technically via the system prompt we can have it remember every prompt if needed.

But memory is a minor thing. Talking to a knowledgeable librarian or professor you never met is the level we essentially need to get it to for this stuff to take off.

mettamage•1h ago
> Only thing? Just off the top of my head: That the LLM doesn't learn incrementally from previous encounters. That we appear to have run out of training data.

Ha, that almost seems like an oxymoron. The previous encounters can be the new training data!

j-krieger•1h ago
Queries are questions in a sense that they are not the original facts. I don’t think they are useful for training data.
traceroute66•1h ago
> That we appear to have run out of training data.

I think the next iteration of LLM is going to be "interesting", i.e. now that all the websites they used to freely scrape have been increasingly putting up walls.

j-krieger•1h ago
Never before did we have a combination of well and poison where the pollution of the first was both as instantaneous and as easily achieved.

I‘ve yet to see a convincing article for artificial training data.

tliltocatl•1h ago
> LLM doesn't learn incrementally from previous encounters

This. Lack of any way to incorporate previous experience seems like the main problem. Humans are often confidently wrong as well - and avoiding being confidently wrong is actually something one must learn rather than an innate capability. But humans wouldn't repeat same mistake indefinitely.

ACCount37•11m ago
You can gather feedback from inference and funnel that back into model training. It's just very, very hard to do that without shooting yourself in the foot.

The feedback you get is incredibly entangled, and disentangling it to get at the signals that would be beneficial for training is nowhere near a solved task.

Even OpenAI has managed to fuck up there - by accidentally training 4o to be a fully bootlickmaxxed synthetic sycophant. Then they struggled to fix that for a while, and only made good progress at that with GPT-5.

FergusArgyll•1h ago
The problem is the kinds of "data" users will feed it. It's basically an impossible task to put a continuous learning model online and not have it devolve into the optimal mix of stalin & hitler
energy123•29m ago
Re online learning - If I freeze 40 yo Einstein and make it so he can't form new memories beyond 5 minutes, that's still an incredibly useful, generally intelligent thing. Doesn't seem like a problem that needs to be solved on the critical path to AGI.

Re training data - We have synthetic data, and we probably haven't hit a wall. Gpt-5 was only 3.5 months after o3. People are reading too much into the tea leaves here. We don't have visibility into the cost of Gpt-5 relative to o3. If it's 20% cheaper, that's the opposite of a wall, that's exponential like improvement. We don't have visibility into the IMO/IOI medal winning models. All I see are people curve fitting onto very limited information.

lucideer•2h ago
While the thrust of this article is generally correct, I have two issues with it:

1. The words "the only thing" massively underplays the difficulty of this problem. It's not a small thing.

2. One of the issues I've seen with a lot of chat LLMs is their willingness to correct themselves when asked - this might seem, on the surface, to be a positive (allowing a user to steer the AI toward a more accurate or appropriate solution), but in reality it simply plays into users' biases & makes it more likely that the user will accept & approve of incorrect responses from the AI. Often, rather than "correcting" itself it merely "teaches" the AI how to be confidently wrong in an amenable & subtle manner which the individual user finds easy to accept (or more difficult to spot).

If anything, unless/until we can solve the (insurmountable) problem of AI being wrong, AI should at least be trained to be confidently & stubbornly wrong (or right). This would also likely lead to better consistency in testing.

traceroute66•1h ago
> is their willingness to correct themselves when asked

Except they don't correct themselves when asked.

I'm sure we've all been there, many, many, many,many,many times ....

   - User: "This is wrong because X"
   - AI: "You're absolutely right !  Here's a production-ready fixed answer"
   - User: "No, that's wrong because Y"
   - AI: "I apologise for frustrating you ! Here's a robust answer that works"
   - User: "You idiot, you just put X back in there"
   - and so continues the vicious circle....
stetrain•1h ago
Yep, the LLM will happily continue this spiral indefinitely but I've learned that if providing a bit more context and one correction doesn't provide a good solution, continuing is generally a waste of time.

They tend to very quickly lose useful context of the original problem and stated goals.

nyeah•1h ago
Yes, that is the point of the comment.
stetrain•14m ago
Yes, you’re absolutely right! Agreeing with the comment and adding my own experience was the point of my comment.

Is there anything else I can help you with?

therobots927•1h ago
Yeah I think our jobs are safe. Why doesn’t anyone acknowledge loops like this? They happen all the time and I’m only using it once a week at the most
traceroute66•1h ago
The AI-fanbois will quickly tell you that you are misusing the context or your prompt is "wrong".

But I've had it consistently happens to me on tiny contexts (e.g. I've had to spend time trying - and failing - to get it to fix a mess it was making with a straightforward 200-ish line bash script).

And its also very frequently happened to me when I've been very careful with my prompts (e.g. explicitly telling it to use a specific version of a specific library ... and it goes and ignores me completely and picks some random library).

gavinray•1h ago
I'd be curious if you could share some poor-performing prompts.

I would be willing to record myself using them across paid models with custom instructions and see if the output is still garbage.

gavinray•1h ago

  > Yeah I think our jobs are safe.
I give myself 6-18 months before I think top-performing LLM's can do 80% of the day-to-day issues I'm assigned.

  > Why doesn’t anyone acknowledge loops like this?
Thisis something you run into early-on using LLM's and learn to sidestep. This looping is a sort of "context-rot" -- the agent has the problem statement as part of it's input, and then a series of incorrect solutions.

Now what you've got is a junk-soup where the original problem is buried somewhere in the pile.

Best approach I've found is to start a fresh conversation with the original problem statement and any improvements/negative reinforcements you've gotten out of the LLM tacked on.

I typically have ChatGPT 5 Thinking, Claude 4.1 Opus, Grok 4, and Gemini 2.5 Pro all churning on the same question at once and then copy-pasting relevant improvements across each.

fuzzzerd•1h ago
> This looping is a sort of "context-rot" -- the agent has the problem statement as part of it's input, and then a series of incorrect solutions.

While I agree, and also use your work around, I think it stands to reason this shouldn't be a problem. The context had the original problem statement along with several examples of what not to do and yet it keeps repeating those very things instead of coming up with a different solution. No human would keep trying one of the solutions included in the context that are marked as not valid.

traceroute66•48m ago
> No human would keep trying one of the solutions included in the context that are marked as not valid.

Exactly. And certainly not a genius human with the memory of an elephant and a PhD in Physics .... which is what we're constantly told LLMs are. ;-)

gavinray•39m ago

  > No human would keep trying one of the solutions included in the context that are marked as not valid.
Yeah, definitely not. Thankfully for my employment status, we're not at "human" levels QUITE yet
dinfinity•35m ago
I concur. Something to keep in mind is that it is often more robust to pull an LLM towards the right place than to push it away from the wrong place (or more specifically, the active parts of its latent space). Sidenote: also kind of true for humans.

That means that positively worded instructions ("do x") work better than negative ones ("don't do y"). The more concepts that you don't want it to use / consider show up in the context, the more they do still tend to pull the response towards them even with explicit negation/'avoid' instructions.

I think this is why clearing all the crap from the context save for perhaps a summarizing negative instruction does help a lot.

gavinray•20m ago

  >  positively worded instructions ("do x") work better than negative ones ("don't do y")
I've noticed this.

I saw someone on Twitter put it eloquently: something about how, just like little kids, the moment you say "DON'T DO XYZ" all they can think about is "XYZ..."

RyanOD•1h ago
But still under pressure in the short-term, no? As companies lean into AI as a means of efficiency / competitive advantage / cost savings, jobs will be eliminated / reduced while companies find their direction. The potential gains are said to be too big to sit on the sidelines and wait to be a late-adopter.
therobots927•1h ago
Yes hold onto your job like your life depends on it because after this bubble pops the job market will get even worse. Then you need to hold on through the trough until experienced engineers are valued again once all of the AI waste flushes out of the system
jansper39•1h ago
Honestly when I speak about these sorts of issues I get the feeling that other people view me as some kind of luddite, especially people above me who presumably want to replace as many people with AI as possible. I suppose me pointing out the flaws breaks the illusion of magic that people want AI to have.
Wowfunhappy•1h ago
...I don't know why, but I swear to god, when Claude gets into one of these cycles I can often get it out by dropping the f-bomb, with maybe a 50% success rate. Something about that word lets it know that it needs to break the pattern.
ACCount37•1h ago
1-turn instruction following and multi-turn instruction following are not the same exact capability, and some AIs only "get good" at the former. 1-turn gets more training attention - because it's more noticeable, in casual use and benchmarks both, and also easier to train for.

With weak multi-turn instruction following, context data will often dominate over user instructions. Resulting in very "loopy" AI - and more sessions that are easier to restart from scratch than to "fix".

Gemini is notorious for underperforming at this, while Claude has relatively good performance. I expect that many models from lesser known providers would also have a multi-turn instruction following gap.

lucideer•1h ago
True. This also often happens.

Probably the ideal would be to have a UI / non-chat-based mechanism for discarding select context.

stetrain•1h ago
Yes, the quick to correct itself isn't really useful. I would not like a human assistant/intern/pair programmer who when asked how to do X said:

> To accomplish X you can just use Y!

But Y isn't applicable in this scenario.

> Oh, you're absolutely right! Instead of Y you can do Z.

Are you sure? I don't think Z accomplishes X.

> On second thought you're absolutely correct. Y or Z will clearly not accomplish X, but let's try Q....

sfn42•1h ago
Being confidently wrong isn't even the problem. It's a symptom of the much deeper problem that these things aren't AI at all, they're just atocomplete bots good enough to kind of seem like AI. There's no actual intelligence. That's the problem.
ninetyninenine•1h ago
No. The experts in the field are past this argument. People have moved on. It is clear to everyone who builds LLMs that the AI is intelligent. The algorithm was autocomplete, but we are finding as an autocomplete bot is basically autocompleting things with humanity changing intelligent content. Your opinion is a minority now and not shared by people on the forefront of building these things. Your holding onto the initial fever pitched alarmist reaction people had to LLMs when it first came out.

Like you realize humans hallucinate too right? And that there are humans that have a disease that makes them hallucinate constantly.

Hallucinations don’t preclude humans from being “intelligent”. It also doesn’t preclude the LLM from being intelligent.

eCa•1h ago
> Like you realize humans hallucinate too right?

A developer that hallucinates at work to the extent that LLMs does would probably have issues getting their PRs past code reviews a lot.

dns_snek•52m ago
> Your opinion is a minority now and not shared by people on the forefront of building these things.

Minority != wrong, with many historic examples that imploded in spectacular fashion. People at the forefront of building these things aren't immune from grandiose beliefs, many of them are practically predisposed to them. They also have a vested interest in perpetuating the hype to secure their generational wealth.

CodexArcanum•45m ago
LLMS don't "hallucinate" they generate a stochastic sequence of plausible tokens that, in context when read by a human, are a false statement or nonsensical.

They also dont have an internal world model. Well I don't think so, but the debate is far from settled. "Experts" like the cofounders of various AI companies (whose livelihood depends on selling these things) seem to believe that. Others do not.

https://aiguide.substack.com/p/llms-and-world-models-part-1

https://yosefk.com/blog/llms-arent-world-models.html

kilpikaarna•21m ago
> It is clear to everyone who builds LLMs that the AI is intelligent.

So presumably we have a solid, generally-agreed-upon definition on intelligence now?

> autocompleting things with humanity changing intelligent content.

What does this even mean?

indigoabstract•10m ago
I think what matters most is that we now know that it's possible, that a computer mimicking most of our abilities (but not all) which we have long considered intelligent is obviously possible in some indeterminate future.

It's not obvious how long until that point or what form it will finally take, but it should be obvious that it's going to happen at some point.

My speculation is that until AI starts having senses like sight, hearing, touch and the ability to learn from experience, it will always be just a tool/help/aider to someone doing a job, but could not possibly replace that person in that job as it lacks the essential feedback mechanisms for successfully doing that job in the first place.

decentrality•1h ago
Agreed with #1 ( came here to say that also )

Pronoun and noun wordplay aside ( 'Their' ... `themselves` ) I also agree that LLMs can correct the path being taken, regenerate better, etc...

But the idea that 'AI' needs to be _stubbornly_ wrong ( more human in the worst way ) is a bad idea. There is a fundamental showing, and it is being missed.

What is the context reality? Where is this prompt/response taking place? Almost guaranteed to be going on in a context which is itself violated or broken; such as with `Open Web UI` in a conservative example: Who even cares if we get the responses right? Now we have 'right' responses in a cul-de-sac universe. This might be worthwhile using `Ollama` in `Zed` for example, but for what purpose? An agentic process that is going to be audited anyway, because we always need to understand the code? And if we are talking about decision-making processes in a corporate system strategy... now we are fully down the rabbit hole. The corporate context itself is coming or going on whether it is right/wrong, good/evil, etc... as the entire point of what is going on there. The entire world is already beating that corporation to death or not, or it is beating the world to death or not... so the 'AI' aspect is more of an accelerant of an underlying dynamic, and if we stand back... what corporation is not already stubbornly wrong, on average?

taco_emoji•1h ago
> Pronoun and noun wordplay aside ( 'Their' ... `themselves` )

How is that wordplay? Those are the correct pronouns.

energy123•1h ago
Mechanistic interpretability could play a role here. The sycophancy you describe in chat mode could be when the question is "too difficult" and the AI defaults to easy circuits that rely on simple rule of thumbs (like does the context contain positive words such as "excellent"). The user experiences this as the AI just following basic nudges.

Could real-time observability into the network's internals somehow feed back into the model to reduce these hallucination-inducing shortcuts? Like train the system to detect when a shortcut is being used, then do something about it?

stingraycharles•1h ago
> 1. The words "the only thing" massively underplays the difficulty of this problem. It's not a small thing.

Exactly. One could argue that this is just an artifact from the fundamental technique being used: it’s a really fancy autocomplete based on a huge context window.

People still think there’s actual intelligence in there, while the actual problems by making these systems appear intelligent is mostly algorithms and software managing exactly what goes into these context windows at what place.

Don’t get me wrong: it feels like magic. But I would argue that the only way to recognize a model being “confidently wrong” is to let another model, trained on completely different datasets with different techniques, judge them. And then preferably multiple.

(This is actually a feature of an MCP tool I use, “consensus” from zen-mcp-server, which enables you to query multiple different models to reach a consensus on a certain problem / solution).

ninetyninenine•1h ago
It’s not massively underplaying it imo. AI hype is real. This is revolutionary technology that humanity has never seen before.

But it happened at a time where hype can be delivered at a magnitude never before seen by humanity as well to a degree of volume that is completely unnatural by any standard set previously by hype machines created by humanity. Not even landing on the moon has inundated people with as much hype. But inevitably like landing on the moon, humanity is suffering from hype fatigue.

Too much hype makes us numb to the reality of how insane the technology is.

Like when someone says the only thing stopping LLMs is hallucinations… that is literally the last gap. LLMs cover creativity, comprehension, analysis, knowledge and much more. Hallucinations is it. The final problem is targeted and boxed into something much more narrower then just build a human level AI from scratch.

Don’t get me wrong. Hallucinations are hard. But this being the last thing left is not an underplay. Yes it’s a massive issue but yes it is also a massive achievement to reduce all of agi to simply solving just an hallucination problem.

taco_emoji•1h ago
Oh, buddy, LLM hallucinations are not the only gap left for AGI
indigoabstract•27m ago
I think you would have really enjoyed living in the '50s, when the future was bright and colonizing Mars was basically a solved problem.

What we got instead is a bunch of wisecracking programmers who like to remind everyone of the 90–90 rule, or the last 10 percent.

dns_snek•1h ago
> but in reality it simply plays into users' biases & makes it more likely that the user will accept & approve of incorrect responses from the AI.

Yes! I often find myself overthinking my phrasing to the nth degree because I've learned that even a sprinkle of bias can often make the LLM run in that direction even if it's not the correct answer.

It often feels a bit like interacting with a deeply unstable and insecure people pleasing person. I can't say anything that could possibly be interpreted as a disagreement because they'll immediately flip the script, I can't mention that I like pizza before asking them what their favorite food is because they'll just mirror me.

KoolKat23•16m ago
Gemini 2.5 pro is quite good at being stubborn (well at least the initial release versions, haven't tested since).
roxolotl•2h ago
The big thing here is that they can’t even be confident. There is no there there. They are a, admittedly very useful, statistical model. Ascribing confidence to it is an anthropomorphizing mistake which is easy to make since we’re wired to trust text that feels human.

They are at their most useful when it is cheaper to verify their output than it is to generate it yourself. That’s why code is rather ok; you can run it. But once validation becomes more expensive than doing it yourself, be it code or otherwise, their usefulness drops off significantly.

projektfu•1h ago
The article buries the lede by waiting until the very end to talk about solutions like having the LLM write DSL code. Presumably if you feed an LLM your orders table and a question about it, you'll get an answer that you can't trust. But if you ask it to write some SQL or similar thing based on your database to get the answer and run it, you can have more confidence.
z3c0•1h ago
Agreed. All these attempts to benchmark LLM performance based on the interpreted validity of the outputs are completely misguided. It may be the semantics of "context" causing people to anthropomorphize the models (besides the lifelike outputs.) Establishing context for humans is the process of holding external stimuli against an internal model of reality. Context for an LLM is literally just "the last n tokens". In that case, the performance would be how valid the most probablistic token was with the prior n tokens being present, which really has nothing to do with the perceived correctness of the output.
dankobgd•2h ago
i am pretty sure it has many more problems
darth_avocado•1h ago
Funnily the same thing would get you promoted in corporate America as a human
jqpabc123•1h ago
But only if you are physically attractive and skilled at golf.
NoGravitas•1h ago
The thing holding AI back is that LLMS are not world models, and do not have world models. Being confidently wrong is just a side effect of that. You need a model of the world to be uncertain about. Without one, you have no way to estimate whether your next predicted sentence is true, false, or uncertain; one predicted sentence is as good as another as long as it resembles the training data.
mojuba•59m ago
In other words, just like with autonomous driving, you need real world experience aka general intelligence to be truly useful. Having a model of the world and knowing your place in it is one of the critical parts of intelligence that both autonomous vehicle systems and LLM's are missing.
squigz•1h ago
I've said from the beginning that until an LLM can determine and respond with "I do not know that", their usefulness will be limited and they cannot be trusted.
ColinEberhardt•1h ago
I agree with the overall sentiment here, having written something similar recently:

“LLMs don’t know what they don’t know” https://blog.scottlogic.com/2025/03/06/llms-dont-know-what-t...

But I wouldn’t say it is the only problem with this technology! Rather, it is a subtle issue that most users don’t understand

paul7986•1h ago
And being overHyped with the doom and gloom of it's affects on society.

chatGPT (5) is not there especially in replacing my field and skills: graphic, web design and web development. The first 2 there it spits out solid creations per your prompt request yet can not edit it's creations just creates new ones lol. So it's just another tool in my arsenal not a replacement to me.

Such Makes me wonder how it generates the logos and website designs ... is it all just hocus pocus.. the Wizard of OZ?

nijave•1h ago
I don't know much about it but apparently we've been having success at work with Figma MCP hooked up to Claude in Cursor. Apparently it can pull from our component library and generate useable code (although still needs engineering to productionalize)

I don't know about replacing anyone but our UI/UX designers are claiming it's significantly faster than traditional mock ups

paul7986•28m ago
Well until these LLMs are able to spit out initial creations it's user likes and then is able to edit it properly per each request entered into the text prompt our jobs are safe! Even better if you are also a UX Researcher along with a Designer and Developer. Research requires human interaction and AI can't touch that present to a decade or more away.
blibble•1h ago
the only thing holding me back from being a billionare is my lack of a billion dollars
dgfitz•1h ago
s/confidently//

Because “ai” is fallible, right now it is at best a very powerful search engine that can also muck around in (mostly JavaScript) codebases. It also makes mistakes in code, adds cruft, and gives incorrect responses to “research-type” questions. It can usually point you in the right direction, which is cool, but Google was able to do that before its enshittification.

s/AI/LLMs

The part where people call it AI is one of the greatest marketing tricks of the 2020s.

tangotaylor•1h ago
I don't think humans are good at assessing the accuracy of their own opinions either and I'm not sure how AI is going to do it. Usually what corrects us is failure: some external stimulus that is indifferent or hostile to us.

As Mazer Rackham from Ender's Game said: "Only the enemy shows you where you are weak."

nijave•1h ago
Maybe AI isn't artificial enough here...
mtkd•1h ago
The link is a sales pitch for some tech that uses MCPs ... see the platform overview on the product top menu

Because MCPs solve the exact issue the whole post is about

rar00•1h ago
I know people are pushing back, taking "only" literally, but from a reasonable perspective what causes LLMs (technically their outputs) to give that impression is indeed the crux of what holds progress back: how/what LLMs learn from data. In my personal opinion, there's something fundamentally flawed the whole field has yet to properly pinpointing and fix.
jqpabc123•1h ago
there's something fundamentally flawed the whole field has yet to properly pinpointing and fix.

Isn't it obvious?

It's all built around probability and statistics.

This is not how you reach definitive answers. Maybe the results make sense and maybe they're just nice sounding BS. You guess which one is the case.

The real catch --- if you know enough to spot the BS, you probably didn't need to ask the question in the first place.

jqpabc123•1h ago
Being able to recall all the data from the internet doesn't make you "intelligent".

It makes you a walking database --- an example of savant syndrome.

Combine this with failure on simple logical and cognitive tests and the diagnosis would be --- idiot savant.

This is the best available diagnosis of an LLM. It excels at recall and text generation but fails in many (if not most) other cognitive areas.

But that's ok, let's use it to replace our human workers and see what happens. Only an idiot would expect this to go well.

https://nypost.com/2024/06/17/business/mcdonalds-to-end-ai-d...

myahio•1h ago
Yep, this is why I'm skeptical about using LLMs as a learning tool
JCM9•1h ago
Add to being confidently wrong is the super annoying way it corrects itself after disastrously screwing something up.

AI: “I’ve deployed the API data into your app, following best practices and efficient code.”

Me: “Nope thats totally wrong and in fact you just wrote the API credential into my code, in plaintext, into the JavaScript which basically guarantees that we’re gonna get hacked.”

AI: “You’re absolutely right. Putting API credentials into the source code for the page is not a best practice, let me fix that for you.”

jqpabc123•1h ago
AI Apologetics: "It's all your fault for not being specific enough."
CloseChoice•1h ago
LLMs are largely used by developers, who (in some sense or the other) supervise what the LLM does constantly (even if that means for sum committing to main and running in production). We do already have a lot of tools: tests, compilation, a programming language with its harsh restrictions compared to natural language, and of course the eye test, this is not the case for a lot of jobs where GenAI is used for hyperautomation, so I am really curious in which way it will or won't get adopted in other areas.
nyeah•1h ago
PG pointed this out a while back. He said that AIs were great at generating typical online comments. (NB I don't know which site's comments he might have been referring to.)
merelysounds•1h ago
I’m especially surprised by how little progress has been made. Today’s hallucinations, while less frequent, continue to have a major negative impact. And the problem has been noticed since the start.

> "I will admit, to my slight embarrassment … when we made ChatGPT, I didn't know if it was any good," said Sutskever.

> "When you asked it a factual question, it gave you a wrong answer. I thought it was going to be so unimpressive that people would say, 'Why are you doing this? This is so boring!'" he added.

https://www.businessinsider.com/chatgpt-was-inaccurate-borin...

SalariedSlave•1h ago
Anybody remember active learning? I'm old, and ML was much different back then, but this reminds me of grueling annotation work I had to do.

On a different note: is it just me or are some parts of this article oddly written? The sentence structure and phrasing read as confusing - which I find ironic, given the context.

giancarlostoro•1h ago
What's really funny to me is, sometimes it fixes itself if you just ask "are you SURE ABOUT THIS ANSWER?" myself and others often wonder, why the heck don't they run a 2nd model to "proofread" output or spot check it. Like did you actually answer the question or are you going off a really weird tangent.

I asked Perplexity some question for sample UI code for Rust / Slint, it gave me a beautiful web UI, I think it got confused because I wanted to make a UI for an API that has its own web UI, I told it you did NOT give me code for Slint, even though some of its output made references to "ui.slint" and other Rust files, it realized its mistake and gave me exactly what I wanted to see.

tl;dr why dont llms just vet themselves with a new context window to see if they actually answered the question? The "reasoning" models don't always reason.

lenerdenator•59m ago
Works fine for humans; I guess we'll know that AI has truly reached human levels of intelligence when being confidently wrong stops holding it back.
esafak•44m ago
Bayesian models solve this problem but they occupy model capacity which practitioners have traditionally preferred to devote to improving the point estimate.
ChrisMarshallNY•15m ago
My favorite is "Tested and Verified," then giving me code that won't even compile.
corytheboyd•12m ago
Isn’t it obvious that the confidently wrong problem will never go away because all of this is effectively built on a statistical next token matcher? Yeah sure you can throw on hacks like RAG, more context window, but it’s still built on the same foundation.

It’s like saying you built a 3D scene on a 2D plane. You can employ clever tricks to make 2D look 3D at the right angle, buts it’s fundamentally not 3D, which obviously shows when you take the 2D thing and turn it.

It seems like the effectiveness plateau of these hacks will soon be (has been?) reached and the smoke and mirrors snake oil sales booths cluttering Main Street will start to go away. Still a useful piece of tech, just, not for every-fucking-thing.

yifanl•1m ago
There are people convinced that if we throw a sufficient amount of training data and VC money at more hardware, we'll overcome the gap.

Technically, I can't prove that they're wrong, novel solutions sometimes happen, and I guess the calculus is that it's likely enough to justify a trillion dollars down the hole.

kemcho•10m ago
The angle that being to detect confidently wrong, which then helps kicks off new learning is interesting.

Has anyone had any success with continuous learning type AI products? Seems like there’s a lot of hype around RL to specialise.