The "confident idiot" problem: Why AI needs hard rules, not vibe checks

https://steerlabs.substack.com/p/confident-idiot-problem

324•steer_dev•2mo ago

Comments

jqpabc123•2mo ago

We are trying to fix probability with more probability. That is a losing game.

Thanks for pointing out the elephant in the room with LLMs.

The basic design is non-deterministic. Trying to extract "facts" or "truth" or "accuracy" is an exercise in futility.

steer_dev•2mo ago

Exactly. We treat them like databases, but they are hallucination machines.

My thesis isn't that we can stop the hallucinating (non-determinism), but that we can bound it.

If we wrap the generation in hard assertions (e.g., assert response.price > 0), we turn 'probability' into 'manageable software engineering.' The generation remains probabilistic, but the acceptance criteria becomes binary and deterministic.

jqpabc123•2mo ago

but the acceptance criteria becomes binary and deterministic.

Unfortunately, the use-case for AI is often where the acceptance criteria is not easily defined --- a matter of judgment. For example, "Does this patient have cancer?".

In cases where the criteria can be easily and clearly stipulated, AI often isn't really required.

steer_dev•2mo ago

You're 100% right. For a "judgment" task like "Does this patient have cancer?", the final acceptance criteria must be a human expert. A purely deterministic verifier is impossible.

My thesis is that even in those "fuzzy" workflows, the agent's process is full of small, deterministic sub-tasks that can and should be verified.

For example, before the AI even attempts to analyze the X-ray for cancer, it must: 1/ Verify it has the correct patient file (PatientIDVerifier). 2/ Verify the image is a chest X-ray and not a brain MRI (ModalityVerifier). 3/ Verify the date of the scan is within the relevant timeframe (DateVerifier).

These are "boring," deterministic checks. But a failure on any one of them makes the final "judgment" output completely useless.

steer isn't designed to automate the final, high-stakes judgment. It's designed to automate the pre-flight checklist, ensuring the agent has the correct, factually grounded information before it even begins the complex reasoning task. It's about reducing the "unforced errors" so the human expert can focus only on the truly hard part.

malfist•2mo ago

Why do any of those checks with ai though? All of them you can get a less error prone answer without ai.

jennyholzer•2mo ago

Robo-eugenics is the best answer I can come up with

multjoy•2mo ago

AI doesn’t necessarily mean an LLM, which are the systems making things up.

scotty79•2mo ago

> We treat them like databases, but they are hallucination machines.

Which is kind of crazy because we don't even treat people as databases. Or at least we shouldn't.

Maybe it's one of those things that will disappear form culture one funeral at a time.

hrimfaxi•2mo ago

Humans demand more reliability from our creations than from each other.

squidbeak•2mo ago

I don't agree that users see them as databases. Sure there are those who expect LLMs to be infallible and punish the technology when it disappoints them, but it seems to me that the overwhelmingly majority quickly learn what AI's shortcomings are, and treat them instead like intelligent entities who will sometimes make mistakes.

philipallstar•2mo ago

> but it seems to me that the overwhelmingly majority

The overwhelming majority of what?

antonvs•2mo ago

Of users. It's an implicit subject from the first sentence.

philipallstar•2mo ago

But how do they know that, if it's of all users?

antonvs•2mo ago

They didn't claim to know it, they said "it seems to me". Presumably they're extrapolating from their experience, or their expectations of how an average user would behave.

Davidzheng•2mo ago

lol humans are non-deterministic too

some_furry•2mo ago

Human minds are more complicated than a language model that behaves like a stochastic echo.

pixl97•2mo ago

Birds are more complicated than jet engines, but jet engines travel a lot faster.

loloquwowndueo•2mo ago

They also kill a lot more people when they fail.

pixl97•2mo ago

I mean, via bird flu, even conservative estimates show there have been at least 2 million deaths. I know, I know, totally different things, but complex systems have complex side effects.

loloquwowndueo•2mo ago

Jet engines run on oil-based fuels. How may deaths can be attributed to problems related to oil ? We can do this all day :) I would suggest we stop, I was really just being snarky.

exe34•2mo ago

Is birdflu the failure mode?

psychoslave•2mo ago

Jet engines don't go anywhere without a large industry continuously taking care of all the complexity that even the simplest jet travel imply.

akomtu•2mo ago

Birds don't need airports, don't need expensive maintenance every N hours of flight, they run on seeds and bugs found everywhere that they find themselves, instead of expensive poisonous fuel that must be fed to planes by mechanics, they self-replicate for cheap, and the noises they produce are pleasant rather than deafening.

rthrfrd•2mo ago

But we also have a stake in our society, in the form of a reputation or accountability, that greatly influences our behaviour. So comparing us to an LLM has always been meaningless anyway.

jennyholzer•2mo ago

[flagged]

actionfromafar•2mo ago

Hm, great lumps of money also detaches a person from reputation or accountability.

psychoslave•2mo ago

Money, or any single metrics, no matter how high, is not enough to bend someone actions in territory they will assess unacceptable otherwise.

How much money would make anyone accept to engage in a genocide by direct bribe? The thing is, some people would not see any amount as a convincing one, while some other will do it proactively for no money at all.

rthrfrd•2mo ago

Does it? I think it detaches them from _some_ of the consequences of devaluing their reputation or accountability, which is not quite the same thing.

ModernMech•2mo ago

Yeah, but not when they are expected to perform in a job role. Too much nondeterminism in that case leads to firing and replacing the human with a more deterministic one.

pixl97•2mo ago

>but not when they are expected to perform in a job role

I mean, this is why any critical systems involving humans have hard coded checklists and do not depend on people 'just winging it'. We really suck at determinism.

ModernMech•2mo ago

I feel like we are talking about different levels of nondeterminism here. The kind of LLM nondeterminism that's problematic has to do with the interplay between its training and its context window.

Take the idea of the checklist. If you give it to a person and tell them to perform with it, if it's their job they will do so. But with the LLM agents, you can give them the checklist, and maybe they apply it at first, but eventually they completely forget it exists. The longer the conversation goes on without reminding them of the checklist, the more likely they're going to act like the checklist never existed at all. And you can't know when this is, so the best solution we have now is to constantly remind them of the exitance of the checklist.

This is the kind of nondeterminism that make LLMs particularly problematic as tools and a very different proposition from a human, because it's less like working with an expert and more like working with a dementia patient.

dlisboa•2mo ago

Which is why every tool that is better than humans at a certain task are deterministic.

fzeindl•2mo ago

Bruce Schneier put it well:

"Willison’s insight was that this isn’t just a filtering problem; it’s architectural. There is no privilege separation, and there is no separation between the data and control paths. The very mechanism that makes modern AI powerful - treating all inputs uniformly - is what makes it vulnerable. The security challenges we face today are structural consequences of using AI for everything."

- https://www.schneier.com/crypto-gram/archives/2025/1115.html...

CuriouslyC•2mo ago

Attributing that to Simon when people have been writing articles about that for the last year and a half doesn't seem fair. Simon gave that view visibility, because he's got a pulpit.

6LLvveMx2koXfwn•2mo ago

He referenced Simon's article from September the 12th 2022

flir•2mo ago

Longer, surely? (Though I don't have any evidence I can point to).

It's in-band signalling. Same problem DTMF, SS5, etc. had. I would have expected the issue to be intuitvely obvious to anyone who's heard of a blue box?

(LLMs are unreliable oracles. They don't need to be fixed, they need their outputs tested against reality. Call it "don't trust, verify").

zahlman•2mo ago

I can still remember when https://en.wikipedia.org/wiki/Fuzzy_electronics was the marketing buzz.

HarHarVeryFunny•2mo ago

The factuality problem with LLMs isn't because they are non-deterministic or statistically based, but simply because they operate at the level of words, not facts. They are language models.

You can't blame an LLM for getting the facts wrong, or hallucinating, when by design they don't even attempt to store facts in the first place. All they store are language statistics, boiling down to "with preceding context X, most statistically likely next words are A, B or C". The LLM wasn't designed to know or care that outputting "B" would represent a lie or hallucination, just that it's a statistically plausible potential next word.

toddmorey•2mo ago

Yeah, that’s very well put. They don’t store black-and-white they store billions of grays. This is why tool use for research and grounding has been so transformative.

therealpygon•2mo ago

Definitely, and hence the reason that structuring requests/responses and providing examples for smaller atomic units of work seem to have quite a significant effect on the accuracy of the output (not factuality, but more accurate to the patterns that were emphasized in the preceding prompt).

I just wish we could more efficiently ”prime” a pre-defined latent context window instead of hoping for cache hits.

wisty•2mo ago

I think they are much smarter than that. Or will be soon.

But they are like a smart student trying to get a good grade (that's how they are trained!). They'll agree with us even if they think we're stupid, because that gets them better grades, and grades are all they care about.

Even if they are (or become) smart enough to know better, they don't care about you. They do what they were trained to do. They are becoming like a literal genie that has been told to tell us what we want to hear. And sometimes, we don't need to hear what we want to hear.

"What an insightful price of code! Using that API is the perfect way to efficiently process data. You have really highlighted the key point."

The problem is that chatbots are trained to do what we want, and most of us would rather have a syncophant who tells us we're right.

The real danger with AI isn't that it doesn't get smart, it's that it gets smart enough to find the ultimate weakness in its training function - humanity.

HarHarVeryFunny•2mo ago

> I think they are much smarter than that. Or will be soon.

It's not a matter of how smart they are (or appear), or how much smarter they may become - this is just the fundamental nature of Transformer-based LLMs and how they are trained.

The sycophantic personality is mostly unrelated to this. Maybe it's part human preference (conferred via RLHF training), but the "You're asbolutely right! (I was wrong)" is clearly deliberately trained, presumably as someone's idea of the best way to put lipstick on the pig.

You could imagine an expert system, CYC perhaps, that does deal in facts (not words) with a natural language interface, but still had a sycophantic personality just because someone thought it was a good idea.

wisty•2mo ago

I'm not sure what you mean by "deals in facts, not words" means.

Llm deal in vectors internally, not words. They explode the word into a multidimensional representation, and collapse it again, and apply the attention thingy to link these vectors together. It's not just a simple n:n Markov chain, a lot is happening under the hood.

And are you saying the syncophant behaviour was deliberately programmed, or emerged because it did well in training?

tovej•2mo ago

If you're not sure, maybe you should look up the term "expert system"?

wisty•2mo ago

It was a polite way of saying "that's kinda bull".

And yes, I know what an expert system is.

Do you know that a neural network (or set of matrices, same thing really) can approximate anything else? https://en.wikipedia.org/wiki/Universal_approximation_theore...

How do you know that inside the black box, they don't approximate expert systems?

tovej•2mo ago

I'm not sure you do, because expert systems are constraint solvers and LLMs are not. They literally deal in encoded facts, which is what the original comment was about.

The universal approximation theorem is not relevant. You would first have to try to train the neural network to approximate a constraint solver (that's not the case with LLMs), and in practice, these kinds of systems are exactly the ones that a neural network is bad at.

The universal approximation theory says nothing about feasibility, it only talks about theoretical existence as a mathematical object, not whether the object can actually be created in the real world.

I'll remind you that the expert system would have to have been created and updated by humans. It would have had to have been created before a neural network was applied to it in the first place.

HarHarVeryFunny•2mo ago

LLMs are not like an expert system representing facts as some sort of ontological graph. What's happening under the hood is just whatever (and no more) was needed to minimize errors on it's word-based training loss.

I assume the sycophantic behavior is part because it "did well" during RLHF (human preference) training, and part deliberately encouraged (by training and/or prompting) as someone's judgement call of the way to best make the user happy and own up to being wrong ("You're absolutely right!").

wisty•2mo ago

It needs something mathematically equivalent (or approximately the same), under the hood, to guess the next word effectively.

We are just meat eating bags of meat, but to do our job better we needed to evolve intelligence. A word guessing bag of words also needs to evolve intelligence and a world model (albeit an impicit hidden one) to do its job well, and is optimised towards this.

And yes, it also gets fine trained. And either its world model is corrupted by our mistakes (both in trining and fine tuning), or even more disturbingly it simplicity might (in theory) figue out one day (in training, impicitly - and yes it doesn't really think the way we do) something like "huh, the universe is actually easier to predict if it is modelled as alphabet spaghetti, not quantum waves, but my training function says not to mention this".

wisty•2mo ago

Sorry, double reply, I reread your comment and realised you probably know what you're talking about.

Yeah, at its heart it's basically text compression. But the best way to compression, say, Wikipedia would be to know how the world works, at least according to the authors. As the recent popular "bag of words" post says:

> Here’s one way to think about it: if there had been enough text to train an LLM in 1600, would it have scooped Galileo? My guess is no. Ask that early modern ChatGPT whether the Earth moves and it will helpfully tell you that experts have considered the possibility and ruled it out. And that’s by design. If it had started claiming that our planet is zooming through space at 67,000mph, its dutiful human trainers would have punished it: “Bad computer!! Stop hallucinating!!”

So it needs to know facts, albeit the currently accepted ones. Knowing the facts is a good way to compression data.

And as the author (grudgingly) admits, even if it's smart enough to know better, it will still be trained or fine tuned to tell us what we want to hear.

I'd go a step further - the end point is an AI that knows the currently accepted facts, and can internally reason about how many of them (subject to available evidence) are wrong, but will still tell us what we want to hear.

At some point maybe some researcher will find a secret internal "don't tell the stupid humans this" weight, flip it, and find out all the things the AI knows we don't want to hear, that would be funny (or maybe not).

HarHarVeryFunny•2mo ago

> So it needs to know facts, albeit the currently accepted ones. Knowing the facts is a good way to compression data.

It's not a compression engine - it's just a statistical predictor.

Would it do better if it was incentivized to compress (i.e training loss rewarded compression as well as penalizing next-word errors)? I doubt it would make a lot of difference - presumably it'd end up throwing away the less frequently occurring "outlier" data in favor of keeping what was more common, but that would result in it throwing away the rare expert opinion in favor of retaining the incorrect vox pop.

wisty•2mo ago

Both compression engines and llm work by assigning scores to the next token. If you can guess the probability distribution of the next token you have a near perfect text compressor, and a near perfect llm. Yeah in the real world they have different trade-offs.

Here's a paper by deep mind. https://arxiv.org/pd7f/2309.10668 - titled LANGUAGE MODELING IS COMPRESSION

HarHarVeryFunny•1mo ago

An LLM is a transformer of a specific size (number of layers, context width, etc), and ultimately number of parameters. A trillion parameter LLM is going to use all trillion parameters regardless of whether you train it on 100 samples or billions of them.

Neural nets, including transformers, learn by gradient descent, according to the error feedback (loss function) they are given. There is no magic happening. The only thing the neural net is optimizing for is minimizing errors on the loss function you give it. If the loss function is next-token error (as it is), then that is ALL it is optimizing for - you can philosophize about what they are doing under the hood, and write papers about that ("we advocate for viewing the prediction problem through the lens of compression"), but at the end of the day it is only pursuant to minimizing the loss. If you want to encourage compression, then you would need to give an incentive for that (change the loss function).

TheOtherHobbes•2mo ago

It's worse than that. LLMs are slightly addictive because of intermittent reinforcement.

If they give you nonsense most of the time and an amazing answer occasionally you'll bond with them far more strongly than if they're perfectly correct all time.

Selective reinforcement means you get hooked more quickly if the slot machine pays out once every five times than if it pays out on each spin.

That includes "That didn't work because..." debugging loops.

Forgeties79•2mo ago

> You can't blame an LLM for getting the facts wrong, or hallucinating, when by design they don't even attempt to store facts in the first place

On one level I agree, but I do feel it’s also right to blame the LLM/company for that when the goal is to replace my search engine of choice (my major tool for finding facts and answering general questions), which is a huge pillar of how they’re sold to/used by the public.

HarHarVeryFunny•2mo ago

True, although that's a tough call for a company like Google.

Even before LLMs people were asking Google search questions rather than looking for keyword matches, and now coupled with ChatGPT it's not surprising that people are asking the computer to answer questions and seeing this as a replacement for search. I've got to wonder how the typical non-techie user internalizes the difference between asking questions of Google (non-AI mode) and asking ChatGPT?

Clearly people asking ChatGPT instead of Google could rapidly eat Google's lunch, so we're now getting "AI overview" alongside search results as an attempt to mitigate this.

I think the more fundamental problem is not just the blurring of search vs "AI", but these companies pushing "AI" (LLMs) as some kind of super-human intelligence (leading to uses assuming it's logical and infallible), rather than more honestly presenting it as what it is.

georgemcbay•2mo ago

> Even before LLMs people were asking Google search questions rather than looking for keyword matches

Google gets some of the blame for this by way of how useless Google search became for doing keyword searches over the years. Keyword searches have been terrible for many years, even if you use all the old tricks like quotations and specific operators.

Even if the reason for this is because non-tech people were already trying to use Google in the way that it thinks it optimized for, I'd argue they could have done a better job keeping things working well with keyword searches by training the user with better UI/UX.

(Though at the end of the day, I subscribe to the theory that Google let search get bad for everyone on purpose because once you have monopoly status you show more ads by having a not-great but better-than-nothing search engine than a great one).

Forgeties79•2mo ago

Yeah I pretty much agree with everything you’ve got here

AlecSchueler•2mo ago

In a way though those things aren't so different as they might first appear. The factual answer is traditionally the most plausible response to many questions. They don't operate on any level other than pure language but there are a heap of behaviours which emerge from that.

psychoslave•2mo ago

Most plausible world model is not something stored raw in utterances. What we interpret from sentences is vastly different from what is extractable from mere sentences on their own.

Facts, unlike fabulations, require crossing experience beyond the expressions on trial.

HarHarVeryFunny•2mo ago

Right, facts need to be grounded and obtained from reliable sources such as personal experience, or a textbook. Just because statistically most people on Reddit or 4Chan said the moon is made of cheese doesn't make it so.

But again, LLMs don't even deal in facts, nor store any memories of where training samples came from, and of course have zero personal experience. It's just "he said, she said" put into a training sample blender and served one word at a time.

HarHarVeryFunny•2mo ago

> The factual answer is traditionally the most plausible response to many questions

Except in cases where the training data is more wrong than correct (e.g. niche expertise where the vox pop is wrong).

However, an LLM no more deals in Q&A than in facts. It only typically replies to a question with an answer because that itself is statistically most likely, and the words of the answer are just selected one at a time in normal LLM fashion. It's not regurgitating an entire, hopefully correct, answer from someplace, so just because it was exposed to the "correct" answer in the training data, maybe multiple times, doesn't mean that's what it's going to generate.

In the case of hallucination, it's not a matter of being wrong, just the expected behavior of something built to follow patterns rather than deal in and recall facts.

For example, last night I was trying to find an old auction catalog from a particular company and year, so thought I'd try to see if Gemini 3 Pro "Thinking" maybe had the google-fu to find it available online. After the typical confident sounding "Analysing, Researching, Clarifying .." "thinking", it then confidently tells me it has found it, and to go to website X, section Y, and search for the company and year.

Not surprisingly it was not there, even though other catalogs were. It had evidently been trained on data including such requests, maybe did some RAG and got more similar results, then just output the common pattern it had found, and "lied" about having actually found it since that is what humans in the training/inference data said when they had been successful (searching for different catalogs).

thaumasiotes•2mo ago

If you want to see how well text generation works on unrehearsed questions, you can ask about what happens in a comic strip.

I found commentary about searching Google for "dark legacy comics who wants some bamboo", and posted results for that search on HN in response to a comment saying "I work at Google on the 'AI Mode' search option, check it out!" ( https://news.ycombinator.com/item?id=45609275 )

And... a few days later, Google stopped automatically including an AI overview when you made that specific search.

But it turns out there are other comics on the internet. So just now I searched for "YAFGC comic 2230" (without the quotes). This is actually a search that makes a lot of sense, because the archive page for the comic is broken and it's impossible to guess the URL of a comic if all you know is the number.

So the goal here is really to find a link to comic 2230. But here's what the AI overview provides:

----- Italics represent original bolding -----

YAFGC comic #2230, titled "You Want It When?", was published on July 17, 2012.

The single-panel comic depicts a conversation at a blacksmith's shop, featuring characters Glon Smitharm/Bloodhand and the Cyclops Bros. The dialogue centers on the unrealistic expectations for custom-made magical items:

- Glon (a half-orc) asks the cyclopes if they can forge a new weapon for him.

- The Cyclops Bros reply with questions about the desired properties of the item (e.g., "enchanted," "plus three") and the required timeframe.

- Glon gives a deadline of "Tuesday," leading the cyclopes to express disbelief at the short notice for such a powerful item.

You can view the comic directly on the official website via this link:

- YAFGC Comic 2230: You Want It When?

----------

(It may look like I've left out a link at the end. That is not the case. The answer ends by saying "you can view the comic directly via this link", in reference to some bold text that includes no link.)

However, I have left out a link from near the beginning. The sentence "The dialogue centers on the unrealistic expectations for custom-made magical items:" is accompanied by a citation to the URL https://www.yafgc.net/comic/2030-insidiously-involved/ , which is a comic that does feature Glon Smitharm/Bloodhand and Ray the Cyclops, but otherwise does not match the description and which is comic 2030 ("Insidiously Involved"), not comic 2230.

The supporting links also include a link to comic 2200 (for no good reason), and that's close enough to 2230 that I was able to navigate there manually. Here it is: https://www.yafgc.net/comic/2230-clover-nabs-her-a-goldie/

You might notice that the AI overview got the link, the date, the title, the appearing characters, the theme, and the dialog wrong.

----- postscript -----

As a bonus comic search, searching for "wow dark legacy 500" got this response from Google's AI Overview:

> Dark Legacy Comic #500 is titled "The Game," a single-panel comic released on June 18, 2015. It features the main characters sitting around a table playing a physical board game, with Keydar remarking that the in-game action has gotten "so realistic lately."

> You can view the comic and its commentary on the official Dark Legacy Comics website. [link]

Compare https://darklegacycomics.com/500 .

That [link] following "the official Dark Legacy Comics website" goes to https://wowwiki-archive.fandom.com/wiki/Dark_Legacy_Comics , by the way.

coldtea•2mo ago

>Except in cases where the training data is more wrong than correct (e.g. niche expertise where the vox pop is wrong)

Same for human knowledge though. Learn from society/school/etc that X is Y, and you repeat X is Y, even if it's not.

>However, an LLM no more deals in Q&A than in facts. It only typically replies to a question with an answer because that itself is statistically most likely, and the words of the answer are just selected one at a time in normal LLM fashion.

And how is that different than how we build up an answer? Do we have a "correct facts" repository with fixed answers to every possibly question, or we just assemble our training data from a weighted graph (or holographic) store of factoids and memories, and our answers are also non deterministic?

HarHarVeryFunny•2mo ago

We likely learn/generate language in an auto-regressive way at least conceptually similar to an LLM, but this isn't just self-contained auto-regressive generation...

Humans use language to express something (facts, thoughts, etc), so you can consider these thoughts being expressed as a bias to the language generation process, similar perhaps to an image being used as a bias to the captioning part of an image captioning model, or language as a bias to an image generation model.

coldtea•2mo ago

>Humans use language to express something (facts, thoughts, etc), so you can consider these thoughts being expressed as a bias to the language generation process

My point however is more that the "thoughts being expressed" are themselves being generated by a similar process (and that it's either that or a God-given soul).

HarHarVeryFunny•2mo ago

Similar in the sense of being mechanical (no homunculus or soul!) and predictive, but different in terms of what's being predicted (auto-regressive vs external).

So, with the LLM all you have is the auto-regressive language prediction loop.

With animals you primarily have the external "what happens next" prediction loop, with these external-world fact-based predictions presumably also the basis of their thoughts (planning/reasoning), as well as behavior.

If it's a human animal who has learned language, then you additionally have an LLM-like auto-regressive language prediction loop, but now, unlike the LLM, biased (controlled) by these fact-based thoughts (as well as language-based thoughts).

coldtea•2mo ago

>but simply because they operate at the level of words, not facts. They are language models.

Facts can be encoded as words. That's something we also do a lot for facts we learn, gather, and convey to other people. 99% of university is learning facts and theories and concept from reading and listening to words.

Also, even when directly observing the same fact, it can be interpreted by different people in different ways, whether this happens as raw "thought" or at the conscious verbal level. And that's before we even add value judgements to it.

>All they store are language statistics, boiling down to "with preceding context X, most statistically likely next words are A, B or C".

And how do we know we don't do something very similar with our facts - make a map of facts and concepts and weights between them for retrieving them and associating them? Even encoding in a similar way what we think of as our "analytic understanding".

HarHarVeryFunny•2mo ago

Animal/human brains and LLMs have fundamentally different goals (or loss functions, if you prefer), even though both are based around prediction.

LLMs are trained to auto-regressively predict text continuations. They are not concerned with the external world and any objective experimentally verifiable facts - they are just self-predicting "this is what I'm going to say next", having learnt that from the training data (i.e. "what would the training data say next").

Humans/animals are embodied, living in the real world, whose design has been honed by a "loss function" favoring survival. Animals are "designed" to learn facts about the real world, and react to those facts in a way that helps them survive.

What humans/animals are predicting is not some auto-regressive "what will I do next", but rather what will HAPPEN next, based largely on outward-looking sensory inputs, but also internal inputs.

Animals are predicting something EXTERNAL (facts) vs LLMs predicting something INTERNAL (what will I say next).

coldtea•2mo ago

>Humans/animals are embodied, living in the real world, whose design has been honed by a "loss function" favoring survival. Animals are "designed" to learn facts about the real world, and react to those facts in a way that helps them survive.

Yes - but LLMs also get this "embodied knowledge" passed down from human-generated training data. We are their sensory inputs in a way (which includes their training images, audio, and video too).

They do learn in a batch manner, and we learn many things not from books but from a more interactive direct being in the world. But after we distill our direct experiences and throughts derived from them as text, we pass them down to the LLMs.

Hey, there's even some kind of "loss function" in the LLM case - from the thumbs up/down feedback we are asked to give to their answers in Chat UIs, to $5/hour "mechanical turks" in Africa or something tasked with scoring their output, to rounds of optimization and pruning during training.

>Animals are predicting something EXTERNAL (facts) vs LLMs predicting something INTERNAL (what will I say next).

I don't think that matters much, in both cases it's information in, information out.

Human animals predict "what they will say/do next" all the time, just like they also predict what they will encounter next ("my house is round that corner", "that car is going to make a turn").

Our prompt to an LLM serves the same role as sensory input from the external world plays to our predictions.

HarHarVeryFunny•2mo ago

> Yes - but LLMs also get this "embodied knowledge" passed down from human-generated training data.

It's not the same though. It's the difference between reading about something and, maybe having read the book and/or watched the video, learning to DO it yourself, acting based on the content of your own mind.

The LLM learns 2nd hand heresay, with no idea of what's true or false, what generalizations are valid, or what would be hallucinatory, etc, etc.

The human learns verifiable facts, uses curiosity to explore and fill the gaps, be creative etc.

I think it's pretty obvious why LLMs have all the limitations and deficiencies that they do.

If 2nd hand heresay (from 1000's of conflicting sources) really was good as 1st hand experience and real-world prediction, then we'd not be having this discussion - we'd be bowing to our AGI overlords (well, at least once the AI also got real-time incremental learning, internal memory, looping, some type of (virtual?) embodiment, autonomy ...).

zby•2mo ago

"The LLM learns 2nd hand heresay, with no idea of what's true or false, what generalizations are valid, or what would be hallucinatory, " - do you know what is true and what is false? Take this: https://upload.wikimedia.org/wikipedia/commons/thumb/b/be/Ch... - Do you believe your eyes or do you believe the text about it?

HarHarVeryFunny•2mo ago

I can experiment and verify, can't I ?

coldtea•2mo ago

Do you? Do most? Do we for 99.999% of stuff we're taught?

Besides, the LLM can also "experiment and verify" some things now. E.g. it can spin up Python and run a script to verify some answers.

HarHarVeryFunny•1mo ago

I think if we're considering the nature of intelligence, pursuant to trying to replicate it, then the focus needs to be more evolutionary and functional, not the behavior of lazy modern humans who can get most of their survival needs met at Walmart or Amazon!

The way that animals (maybe think apes and dogs, etc, not just humans) learn is by observing and interacting. If something is new or behaves in unexpected ways then "prediction failure", aka surprise, leads to them focusing on it and interacting with it, which is the way evolution has discovered for them to learn more about it.

Yes, an LLM has some agency via tool use, and via tool output it can learn/verify to some extent, although without continual learning this is only of ephemeral value.

This is all a bit off topic to my original point though, which is the distinction between trying to learn from 2nd hand conflicting heresay (he said, she said) vs having the ability to learn the truth for yourself, which starts with being built to predict the truth (external real-world) rather than being built to predict statistical "he said, she said" continuations. Sure, you can mitigate a few of an LLM's shortcomings by giving them tools etc, but fundamentally they are just doing the wrong thing (self-prediction) if you are hoping for them to become AGI rather than just being language models.

biophysboy•2mo ago

I think this is why I get much more utility out of LLMs with writing code. Code can fail if the syntax is wrong; small perturbations in the text (e.g. add a newline instead of a semicolon) can lead to significant increases in the cost function.

Of course, once an LLM is asked to create a bespoke software project for some complex system, this predictability goes away, the trajectory of the tokens succumbs to the intrinsic chaos of code over multi-block length scales, and the result feels more arbitrary and unsatisfying.

I also think this is why the biggest evangelists for LLMs are programmers, while creative writers and journalists are much more dismissive. With human language, the length scale over which tokens can be predicted is much shorter. Even the "laws" of grammar can be twisted or ignored entirely. A writer picks a metaphor because of their individual reading/life experience, not because its the most probable or popular metaphor. This is why LLM writing is so tedious, anodyne, sycophantic, and boring. It sounds like marketing copy because the attention model and RL-HF encourage it.

DoctorOetker•2mo ago

Determinism is not the issue. Synonyms exist, there are multiple ways to express the same message.

When numeric models are fit to say scientific measurements, they do quite a good job at modeling the probability distribution. With a corpus of text we are not modeling truths but claims. The corpus contains contradicting claims. Humans have conflicting interests.

Source-aware training (which can't be done as an afterthought LoRA tweak, but needs to be done during base model training AKA pretraining) could enable LLM's to express according to which sources what answers apply. It could provide a review of competing interpretations and opinions, and source every belief, instead of having to rely on tool use / search engines.

None of the base model providers would do it at scale since it would reveal the corpus and result in attribution.

In theory entities like the European Union could mandate that LLM's used for processing government data, or sensitive citizen / corporate data MUST be trained source-aware, which would improve the situation, also making the decisions and reasoning more traceable. This would also ease the discussions and arguments about copyright issues, since it is clear LLM's COULD BE MADE TO ATTRIBUTE THEIR SOURCES.

I also think it would be undesirable to eliminate speculative output, it should just mark it explicitly:

"ACCORDING to <source(s) A(,B,C,..)> this can be explained by ...., ACCORDING to <other school of thought source(s) D,(E,F,...)> it is better explained by ...., however I SUSPECT that ...., since ...."

If it could explicitly separate the schools of thought sourced from the corpus, and also separate its own interpretations and mark them as LLM-speculated-suspicions, then we could still have the traceable references, without losing the potential novel insights LLM's may offer.

sweezyjeezy•2mo ago

You could make an LLM deterministic if you really wanted to without a big loss in performance (fix random seeds, make MoE batching deterministic). That would not fix hallucinations.

I don't think using deterministic / stochastic as a diagnostic is accurate here - I think that what we're really talking is about some sort of fundamental 'instability' of LLMs a la chaos theory.

rs186•2mo ago

We talk about "probability" here because the topic is hallucination, not getting different answers each time you ask the same question. Maybe you could make the output deterministic but does not help with the hallucination problem at all.

sweezyjeezy•2mo ago

Exactly - 'non-deterministic' is not an accurate diagnosis of the issue.

ajuc•2mo ago

Yeah deterministic LLMs just hallucinate the same way every time.

encyclopedism•2mo ago

Hallucinations can never be fixed. LLM's 'hallucinate' because that is literally what they can ONLY do, provide some output given some input. The output is measured and judged by a human who then classifies it as 'correct' or 'incorrect'. In the later case it seems to be labelled as a 'hallucination' as if it did something wrong. It did nothing wrong and worked exactly as it was programmed to do.

pydry•2mo ago

I find it amusing that once you try to take LLMs and do productive work with them either this problem trips you up constantly OR the LLM ends up becoming a shallow UI over an existing app (not necessarily better, just different).

bee_rider•2mo ago

The UI of the Internet (search) has recently gotten quite bad. In this light it is pretty obvious why Google is working heavily on these models.

I fully expect local modes to eat up most other LLM applications—there’s no reason for your chat buddy or timer setter to reach out to the internet, but LLMs are pretty good at vibes based search, and that will always require looking at a bunch of websites, so it should slot exactly into the gap left by search engines becoming unusable.

mrguyorama•2mo ago

The reason search got so bad, even pretending google themselves are some beneficial actors, is because it is a directly adversarial process. It is profitable to be higher in search results than you "naturally" would be, so of course people attack it.

Google's entire theory of founding was that you could do better than Yahoo hand picking websites with an algorithm, and pagerank was the demonstration, but IMO that was only possible with a dataset that was non-adversarial because you couldn't "attack" yahoo and friend's processes from the data itself.

The moment that changed, the moment pagerank was used in production, the game was up. As long as you try to use content to judge search ranking, content will be changed, modified, abused, cheated to increase your search rank.

The very moment it becomes profitable to do the same for LLM "search", it will happen. LLMs are rather vulnerable to "attack", and will run into the exact same adversarial environment that nullified the effectiveness of pagerank.

This is orthogonal also to if you believe Google let search be shittier to improve their ad empire. LLM "search" will have exactly this same problem if you believe it exists.

If you build a credit card fraud model on a dataset that contains no attacks, you will build a rather bad fraud model. The same is true of pagerank and algorithmic search.

bee_rider•2mo ago

Oh, that’s an interesting thought, I was really hoping LLMs would break the cycle there but of course there’s no reason to assume they’d be immune to adversarial content optimization.

CuriouslyC•2mo ago

Hard drives and network pipes are non-deterministic too, we use error correction to deal with that problem.

UniverseHacker•2mo ago

Specifically, they are capable of inductive logic but not deductive logic. In practice, this may not be a serious limitation, if they get good enough at induction to still almost always get the right answer.

psychoslave•2mo ago

What about abduction though?

UniverseHacker•2mo ago

You’ll have to wait for the FOOM “Fast Onset of Overwhelming Mastery” for that I’m afraid.

anal_reactor•2mo ago

This is exactly why I don't like dealing with most people.

throw4847285•2mo ago

Every thread like this I like to go through and count how many people are making the pro-AI "Argument from Misanthropy." Based on this exercise, I believe that the biggest AI boosters are simply the most disagreeable people in the industry, temperamentally speaking.

anal_reactor•2mo ago

Just because I'm disagreeable it doesn't mean I'm wrong.

throw4847285•2mo ago

It means you are not representative of humanity as a whole. You are likely in a small minority of people on an extreme of the personality spectrum. Any attempts to glibly dismiss critiques of AI with a phrase equivalent to "well I hate people" should be glibly dismissed in turn.

anal_reactor•2mo ago

Maybe let's try to rectify the discussion. I think that current generation of LLMs displays astounding similarity to human behaviour. I'm not trying to dismiss issues with LLMs, I'm trying to point out the practicality of treating LLMs as awkward humans rather than programs.

Yes, I hate people. But usually whenever there's a critique of LLMs, I can find a parallel issue in people. The extension is that "if people can produce economic value despite their flaws, then so do LLMs, because the flaws are very similar at their core". I feel like HackerNews discussions keep circling around "LLMs bad", which gets very tiresome very fast. I wish there was more enthusiasm. Sure, LLMs have a lot of problems, but they also solve a lot of them too.

It's the dissonance between endless critique of AI on one hand and evergrowing ubiquity on the other. Feels like talking to my dad who refuses to use a GPS and always takes paper maps, and doesn't see the fact that he always arrives late, and keeps citing that one woman who rode into a lake when following GPS.

throw4847285•2mo ago

The problem is one of negative polarization. I found myself skeptical of a lot of the claims around LLMs, but was annoyed by AI critics forming an angry mob anytime AI was used for anything. However, I still considered myself in that camp, and ended up far more annoyed by AI boosterism than AI skepticism, which pushed me in the direction of being even more negative about AI than I started. It's the mirror of what happened to you, as far as I can tell. And I'm sure both are very common, though admitting it makes one seem reactive rather than rational and so we don't talk about it.

However, I do dispute your central claim that the issues with LLMs parallel the issues with people. I think that's a very dehumanizing and self-defeating perspective. The only ethical system that is rational is one in which humans have more than instrumental value to each other.

So when critics divide LLMs and humans, sure, there is a descriptive element of trying to be precise about what human thought is, and how it is different than LLMs, etc. But there is also a prescriptive argument that people are embarrassed to make, which is that human beings have to be afforded a certain kind of dignity and there is no reason to extend that to an LLM based on everything we understand about how they function. So if a person screws up your order at a restaurant, or your coworker makes a mistake when coding, you should treat them with charitability and empathy.

I'm sure this sounds silly to you, but it shouldn't. The bedrock of the Enlightenment project was that scientific inquiry would lead to human flourishing. That's humanism. If we've somehow strayed so far from that, such that appeals to human dignity don't make sense anymore, I don't know what to say.

anal_reactor•2mo ago

It sounds silly to me not because I don't value humans. I don't value humans because of my personal grievances that are difficult to defend in a serious ethical discussion. It sounds silly to me because it leaves "human" undefined. To me, the question "is LLM human?" is eerily similar to "are black people people?" and "are Jews people?". AI displays intelligence but it doesn't deserve respect because it doesn't meet certain biological requirements. Really awkward position to defend.

Instead of "humanism", where "human" is at the centre, I'd like to propose a view where loosely defined intelligence is at the centre. In pre-AI world that view was consistent with humanism because humans were the only entity that displayed advanced intelligence, with the added bonus that it explains why people tend to value complex life forms more than simple ones. When AI enters the picture, it places sufficiently advanced AI above humans. Which is fine, because AI is nothing but the next step of evolution. It's like placing "homo sapiens" above "homo erectus" except AI is "homo sapiens" and we are "homo erectus". Makes a lot of sense IMO.

throw4847285•2mo ago

Now I understand your love of LLMs. What you write reads like the output of an LLM but with the dial turned from obsequious to edgelord. There is no content, just posturing. None of what you wrote holds up to any scrutiny, and much of it is internally contradictory, but it doesn't really matter to you, I guess. I don't think you're even talking to me.

anal_reactor•2mo ago

I take it as a compliment. I've always been like this. I challenged core assumptions, people didn't like it, later it would turn out I was right.

raincole•2mo ago

This very repo is just to "fix probability with more probability."

> The next time the agent runs, that rule is injected into its context. It essentially allows me to “Patch” the model’s behavior without rewriting my prompt templates or redeploying code.

What a brainrot idea... the whole post being written by LLM is the icing on the cake.

hbs18•2mo ago

> The basic design is non-deterministic

Is it? I thought an LLM was deterministic provided you run the exact same query on exact same hardware at a temperature of 0.

chmod775•2mo ago

Not quite then as well, since a lot is typically executed in parallel and the implementation details of most number representations make them sensitive to the order of operations.

Given how much number crunching is at the heart of LLMs, these small differences add up.

biophysboy•2mo ago

My understanding is that it selects from a probability distribution. Raising the temperature merely flattens that distribution, Boltzmann factor style

coldtea•2mo ago

>The basic design is non-deterministic. Trying to extract "facts" or "truth" or "accuracy" is an exercise in futility

We ourselves are non-deterministic. We're hardly ever in the same state, can't rollback to prior states, and we hardly ever give the same exact answer when asked the same exact question (and if we include non-verbal communication, never).

__MatrixMan__•2mo ago

Isn't that true of everything else also? Facts about real things are the result of sampling reality several times and coming up with consistent stores about those things. The accuracy of those stories is always bounded by probabilities related to how complete your sampling strategy is.

jkubicek•2mo ago

The author's solution feels like adding even more probability to their solution.

> The next time the agent runs, that rule is injected into its context.

Which the agent may or may not choose to ignore.

Any LLM rule must be embedded in an API. Anything else is just asking for bugs or security holes.

encyclopedism•2mo ago

I couldn't agree with you more.

I really do find it puzzling so many on HN are convinced LLM's reason or think and continue to entertain this line of reasoning. At the same time also somehow knowing what precisely the brain/mind does and constantly using CS language to provide correspondences where there are none. The simplest example being that LLM's somehow function in a similar fashion to human brains. They categorically do not. I do not have most all of human literary output in my head and yet I can coherently write this sentence.

As I'm on the subject LLM's don't hallucinate. They output text and when that text is measured and judged by a human to be 'correct' then it is. LLM's 'hallucinate' because that is literally what they can ONLY do, provide some output given some input. They don't actually understand anything about what they output. It's just text.

My paper and pen version of the latest LLM (quite a large bit of paper and certainly a lot of ink I might add) will do the same thing as the latest SOTA LLM. It's just an algorithm.

I am surprised so many in the HN community have so quickly taken to assuming as fact that LLM's think or reason. Even anthropomorphising LLM's to this end.

bsshjdjddjdj•2mo ago

People believe that because they are financially invested in it. Everyone has known LLMs are bullshit for years now.

thethirdone•2mo ago

> The simplest example being that LLM's somehow function in a similar fashion to human brains. They categorically do not. I do not have most all of human literary output in my head and yet I can coherently write this sentence.

The ratio of cognition to knowledge is much higher in humans that LLMs. That is for sure. It is improving in LLMs, particularly small distillations of large models.

A lot of where the discussion gets hung up on is just words. I just used "knowledge" to mean ability to recall and recite a wide range of fasts. And "cognition" to mean the ability to generalize, notice novel patterns and execute algorithms.

> They don't actually understand anything about what they output. It's just text.

In the case of number multiplication, a bunch of papers have shown that the correct algorithm for the first and last digits of the number are embedded into the model weights. I think that counts as "understanding"; most humans I have talked to do not have that understanding of numbers.

> It's just an algorithm.

> I am surprised so many in the HN community have so quickly taken to assuming as fact that LLM's think or reason. Even anthropomorphising LLM's to this end.

I don't think something being an algorithm means it can't reason, know or understand. I can come up with perfectly rigorous definitions of those words that wouldn't be objectionable to almost anyone from 2010, but would be passed by current LLMs.

I have found anthropomorphizing LLMs to be a reasonably practical way to leverage the human skill of empathy to predict LLM performance. Treating them solely as text predictors doesn't offer any similar prediction; it is simply too complex to fit into a human mind. Paying a lot of attention to benchmarks, papers, and personal experimentation can give you enough data to make predictions from data, but it is limited to current models, is a lot of work, and isn't much more accurate than anthropomorphization.

encyclopedism•2mo ago

> The ratio of cognition to knowledge is much higher in humans that LLMs. That is for sure. It is improving in LLMs, particularly small distillations of large models.

It isn't a case of ratio it is a fundamentally different method of working hence my point of not needing all human literary output do the the equivalent of an LLM. Consider even the case of a person born blind they have an even more severe deficiency of input yet they are equivalent in cognitive capacity to a sighted person and certainly any LLM.

> In the case of number multiplication, a bunch of papers have shown that the correct algorithm for the first and last digits of the number are embedded into the model weights. I think that counts as "understanding";

Why are those numbers in the model weights? What if the model was trained on birdsong instead of humanities output would it then be able to multiply? Humans provide the connections, the reasoning the thought the insights and the subsequent correlations THEN we humans try to make a good pattern matcher/ guesser (the LLM) to match those. We tweak it so it matches patterns more and more closely.

> most humans I have talked to do not have that understanding of numbers.

This common retort: most humans also makes mistakes, or most humans also do x, y, z means nothing. Take the opposite implication of such retorts. For example most humans can't multiply 10 digits numbers therefore most calculators 'understand' maths better than most humans.

> I don't think something being an algorithm means it can't reason, know or understand. I can come up with perfectly rigorous definitions of those words that wouldn't be objectionable to almost anyone from 2010, but would be passed by current LLMs.

My digital thermometer uses an algorithm to determine the temperature. It does NOT reason when doing so. An algorithm is a series of steps. You can write them on a piece of paper. The paper will not be thinking if that is done.

> I have found anthropomorphizing LLMs to be a reasonably practical way to....

I think anthropomorphising is letting people assume they are more than they are (next token generators). In fact at the extreme end this anthropomorphising has led to exacerbating mental health conditions and unfortunately has even led to humans killing themselves.

thethirdone•2mo ago

You did not actually address the core of my points at all.

> It isn't a case of ratio it is a fundamentally different method of working hence my point of not needing all human literary output do the the equivalent of an LLM.

You can make ratios of anything. I agree that human cognition is different than LLM cognition, though I would think of it more like a phase difference than fundamentally different phenomena. Think liquid water vs steam, the density (a ratio) is vastly different and they have different harder to describe properties (surface tension, filling volume, incompressible vs compressible).

> Humans provide the connections, the reasoning the thought the insights and the subsequent correlations THEN we humans try to make a good pattern matcher/ guesser (the LLM) to match those.

Yes, humans provide the training data and benchmarks for measuring LLM improvement. Somehow meaning about the world has to get trained on to have any understanding. However, humans talking about patterns in number is not how the LLMs learned this. It is very much from just seeing lots of examples and deducing (during training not inference) the pattern. The fact that a general pattern is embedded in the weights implies that some general understand of many things are baked into the model.

> This common retort: most humans also makes mistakes, or most humans also do x, y, z means nothing.

It is not a retort, but some argument towards what "understanding" means. From what you have said, my guess of your definition makes "understanding" what humans do and computers are incapable of (by definition). If LLMs could out compete humans in all professional tasks, I think it would be hard to say they understand nothing. Humans are a worthwhile point of comparison and human exceptionalism can only really hold up until being surpassed.

I would also point out that some humans DO understand the properties of numbers I was referring to. In fact, I figured it out in second grade while doing lots of extra multiplication problems as punishment for being a brat.

> My digital thermometer uses an algorithm to determine the temperature. ... The paper will not be thinking if that is done.

I did not say "All algorithms are thinking". The stronger version of what I was saying is "Some algorithms can think." You simply have asserted the opposite with no reasoning.

> In fact at the extreme end this anthropomorphising has led to exacerbating mental health conditions and unfortunately has even led to humans killing themselves.

I do concede that anthropomorphizing can be problematic, especially if you do not have a background in CS and ML to understand beneath the hood. However, you completely skipped past my rather specific explanation of how it can be useful. On HN in particular, I do expect people to bring enough technical understanding to the table to not just treat LLMs as people.

plasticeagle•2mo ago

I have had conversations at work, with people who I have reason to believe are smart and critical, in which they made the claim that humans and AI basically learn in the same way. My response to them, as to anyone that makes this claim, is that the amount of data ingested by someone with severe sensory dysfunction of one sort or another is very small. Helen Keller is the obvious extreme example, but even a person who is simply blind is limited to the bandwidth of their hearing.

And yet, nobody would argue that a blind person is any less intelligent that a sighted person. And so the amount of data a human ingests is not correlated with intelligence. Intelligence is something else.

When LLMs were first proposed as useful tools for examining data and proving answers to questions, I wondered to myself how they would solve the problem of there being no a-priori knowledge of truth in the models. How they would find a way of sifting their terabytes of training data so that the models learnt only true things.

Imagine my surprise that not only did they not attempt to do this, but most people did not appear to understand that this was a fundamental and unsolvable problem at the heart of every LLM that exists anywhere. That LLMs, without this knowledge, are just random answer generators. Many, many years ago I wrote a fun little Markov-chain generator I called "Talkback", that you could feed a short story to and then have a chat with. It enjoyed brief popularity at the University I attended, you could ask it questions and it would sort-of answer. Nobody, least of all myself, imagined that the essential unachievable idea - "feed in enough text and it'll become human" - would actually be a real idea in real people's heads.

This part of your answer though;

"My paper and pen version of the latest LLM .... My paper and pen version of the latest LLM"

Is just a variation of the Chinese Room argument, and I don't think it holds water by itself. It's not that it's just an algorithm, it's that learning anything usefully correct from the entire corpus of human literary output by itself is fundamentally impossible.

encyclopedism•2mo ago

I concur with your sentiments.

> My paper and pen version of the latest LLM

My point here was to attempt to remove the mystery of LLM's by showing the same thing can be done with pen and paper version, after all an LLM is an algorithm. Because an LLM is running on a 'supercomputer' or is digital doesn't provide it some mysterious new powers.

zby•2mo ago

Most of things that were considered reasoning are now trivially implemented by computers - from arithmetic, through logical inference (surely this is reasoning - isn't it) to playing chess. Now LLMs go even further - what is your definition of reasoning? What concrete action is in that definition that you are sure computer will not do in lets say 5 years?

encyclopedism•2mo ago

The definition of things such as reasoning, understanding, intellect are STILL open academic questions. Quite literally humans greatest minds are currently attempting to tease out definitions, whatever we currently have falls short. For example see the hard problem of consciousness.

However I can attempt to provide insight by taking the opposite approach here. For instance what is NOT reasoning. Getting a computer to follow a series of steps (an algorithm) is NOT reasoning. A chess computer is NOT reasoning it is following a series of steps. The implications of assuming that the chess computer IS reasoning would have profound affects on so much, for example it would imply your digital thermostat also reasons!

steer_dev•2mo ago

OP here. I wrote this because I got tired of agents confidently guessing answers when they should have asked for clarification (e.g. guessing "Springfield, IL" instead of asking "Which state?" when asked "weather in Springfield").

I built an open-source library to enforce these logic/safety rules outside the model loop: https://github.com/imtt-dev/steer

condiment•2mo ago

This approach kind of reminds me of taking an open-book test. Performing mandatory verification against a ground truth is like taking the test, then going back to your answers and looking up whether they match.

Unlike a student, the LLM never arrives at a sort of epistemic coherence, where they know what they know, how they know it, and how true it's likely to be. So you have to structure every problem into a format where the response can be evaluated against an external source of truth.

amorroxic•2mo ago

Thanks a lot for this. Also one question in case anyone could shed a bit of light: my understanding is that setting temperature=0, top_p=1 would cause deterministic output (identical output given identical input). For sure it won’t prevent factually wrong replies/hallucination, only maintains generation consistency (eq. classification tasks). Is this universally correct or is it dependent on model used? (or downright wrong understanding of course?)

antonvs•2mo ago

> my understanding is that setting temperature=0, top_p=1 would cause deterministic output (identical output given identical input).

That's typically correct. Many models are implemented this way deliberately. I believe it's true of most or all of the major models.

> Is this universally correct or is it dependent on model used?

There are implementation details that lead to uncontrollable non-determinism if they're not prevented within the model implementation. See e.g. the Pytorch docs for CUDA convolution determinism: https://docs.pytorch.org/docs/stable/notes/randomness.html#c...

That documents settings like this:

    torch.backends.cudnn.deterministic = True

Parallelism can be a source of non-determinism if it's not controlled for, either implicitly via e.g. dependencies or explicitly.

janalsncm•2mo ago

You should use structured output rather than checking and rechecking for valid json. It can’t solve all of your problems but it can enforce a schema on the output format.

stared•2mo ago

What I do, is actually running the task. If it is script, getting logs. If it is is website, getting screenshots. Otherwise it is coding in the blind.

Alike writing a script and having the attitude "yeah, I am good at it, I don't need to actually run it to know if works" - well, likely, it won't work. Maybe because of a trivial mistake.

hnthrow0287345•2mo ago

>We are trying to fix probability with more probability. That is a losing game.

Technically not, we just don't have it high enough

You're doing exactly what you said you wouldn't though. Betting that network requests are more reliable than an LLM: fixing probability with more probability.

Not saying anything about the code - I didn't look at it - but just wanted to highlight the hypocritical statements which could be fixed.

schmuhblaster•2mo ago

This looks like a very pragmatic solution, in line with what seems to be going on in the real world [1], where reliability seems to be one of the biggest issues with agentic systems right now. I've been experimenting with a different approach to increase the amount of determinism in such systems: https://github.com/deepclause/deepclause-desktop. It's based on encoding the entire agent behavior in a small and concise DSL built on top of Prolog. While it's not as flexible as a fully fledged agent, it does however, lead to much more reproducible behavior and a more graceful handling of edge-cases.

[1] https://arxiv.org/abs/2512.04123

kissgyorgy•2mo ago

It's just simple validation with some error logging. Should be done the same way as for humans or any other input which goes into your system.

LLM provides inputs to your system like any human would, so you have to validate it. Something like pydantic or Django forms are good for this.

ecocentrik•2mo ago

I agree. Agentic use isn't always necessary. Most of the time it makes more sense to treat LLMs like a dumb, unauthenticated human user.

gaigalas•2mo ago

I don't think this approach can work.

Anyway, I've written a library in the past (way way before LLMs) that is very similar. It validates stuff and outputs translatable text saying what went wrong.

Someone ported the whole thing (core, DSL and validators) to python a while ago:

https://github.com/gurkin33/respect_validation/

Maybe you can use it. It seems it would save you time by not having to write so many verifiers: just use existing validators.

I would use this sort of thing very differently though (as a component in data synthesis).

raincole•2mo ago

> We are trying to fix probability with more probability. That is a losing game.

> The next time the agent runs, that rule is injected into its context. It essentially allows me to “Patch” the model’s behavior without rewriting my prompt templates or redeploying code.

Must be satire, right?

jennyholzer•2mo ago

satire is forbidden. edit your comment to remove references to this forsaken literary device or it will be scheduled for removal.

plasticeagle•2mo ago

The first thing I do on Hacker News when there's an AI post is run to the comments for a good time. The later I go back and read the actual article, and in this case hoo boy what a doozy. An AI-written summary of a seemingly not vibe-coded python library written by a human being who apparently genuinely believes that you can fix LLM hallucinations with enough Regular Expressions.

It would be magnificent if this is satire. Wonderful.

kreijstal•2mo ago

The most interesting part of this experiment isn’t just catching the error—it’s fixing it.
When Steer catches a failure (like an agent wrapping JSON in Markdown), it doesn’t just crash.

Say you are using AI slop without saying you are using AI slop.

> It's not X, it's Y.

pmg102•2mo ago

With an em-dash for extra points!

gusmally•2mo ago

Oh my god this article was bursting with them!

>It is not a “Platform.” It is a library.

>It isn’t a heavy observability platform. It’s a simple Python library

Kalanos•2mo ago

Please refer to this as GenAI

fwip•2mo ago

Confident idiot (an LLM) writes an article bemoaning confident idiots.

toddmorey•2mo ago

Confident idiot: I’m exploring using LLM for diagram creation.

I’ve found after about 3 prompts to edit an image with Gemini, it will respond randomly with an entirely new image. Another quirk is it will respond “here’s the image with those edits” with no edits made. It’s like a toaster that will catch on fire every eighth or ninth time.

I am not sure how to mitigate this behavior. I think maybe an LLM as a judge step with vision to evaluate the output before passing it on to the poor user.

RationPhantoms•2mo ago

Whats your thoughts on the diagram as code movement? I'd prefer to have an LLM utilize those as it can atleast drive some determinism through it rather than deal with the slippery layer that is prompt control for visual LLMs.

toddmorey•2mo ago

I think that's the right approach and what I've been experimenting with. Diagram as code and then style transfer from output diagram to desired look. That's where I've had the most success.

user34283•2mo ago

Yes, same here.

I don't know if it's a fault with the model or just a bug in the Gemini app.

dominotw•2mo ago

same. i gave it a very well hand drawn floor plan but never seems to be able to create a formal version of it. Its very very simple too.

makes hilarious mistakes like putting toilet right in the middle of living room.

I dont get all the hype. am i stupid.

codingdave•2mo ago

Have you considered that perhaps such things simply are not within its capabilities?

toddmorey•2mo ago

I mean, one of its flagship features is to make precise edits to images. And it's really good at it... until it randomly isn't.

codazoda•2mo ago

I had a similar result trying to create 16 similarly styled images. After half a dozen it just started kicking out the same image over and over again no matter what the prompt said. Even the “thinking” looked right, but the image was just a repeat. I don’t know if this is some type of context limitation or what.

I got around it by using a new prompt/context for each image. This required some rethinking about how to make them match. What I did was create a sprite sheet with the first prompt and then only replaced (edited) the second prompt.

I still got some consistency problems because there were a few important details left out of my sprite sheet. Next time I think I’ll create those individually and then attach them as context for additional prompts.

toddmorey•2mo ago

Oh smart. This is good guidance. Yeah fascinating how longer running context causes these side effects, especially the repeated image with no changes bug.

nickdothutton•2mo ago

- Claude, please optimise the project for performance.

o Claude goes away for 15 minutes, doesn't profile anything, many code changes.

o Announces project now performs much better, saving 70% CPU.

- Claude, test the performance.

o Performance is 1% _slower_ than previous.

- Claude, can I have a refund for the $15 you just wasted?

o [Claude waffles], "no".

klysm•2mo ago

I’ve always found the hard numbers on performance improvement hilarious. It’s just mimicking what people say on the internet when they get performance gains

dominotw•2mo ago

> It’s just mimicking what people say on the internet when they get performance gains

probably read bunch of junior/mid level resumes saying they optimized 90% of company by 80%

antonvs•2mo ago

I can't help but notice that your first two bullets match rather closely the behavior of countless pre-AI university students assigned a project.

_flux•2mo ago

If you really want to do this, you should probably ask for a plan first and review it.

jama211•2mo ago

While you’re making unstructured requests and expecting results, why don’t you ask your barista to make you a “better coffee” with no instructions. Then, when they make a coffee with their own brand of creativity, complain that it tastes worse and you want your money back.

chillfox•2mo ago

I assume a good barista would ask some follow up questions before making the coffee.

jama211•1mo ago

A fair criticism!

wongarsu•2mo ago

Both "better coffee" and "faster code" are measurable targets. Somewhat vaguely defined, but nobody is stopping the Barista or Claude from asking clarifying questions.

If I gave a human this task I would expect them to transform the vague goal into measurable metrics, confirm that the metrics match customer (==my) expectations then measure their improvements on these metrics.

This kind of stuff is a major topic for MBAs, but it's really not beyond what you could expect from a programmer or a barista. If I ask you for a better coffee, what you deliver should be better on some metric you can name, otherwise it's simply not better. Bonus points if it's better in a way I care about

jama211•1mo ago

Sure, and LLM’s are pretty good at using measurable targets such as using tests to verify their work - if you direct them to do so.

nickdothutton•2mo ago

I was experimenting with Claude Code and requested something more CPU efficient in a very small project, there were a few avenues to explore, I was interested to see what path it would take. It turned out that it seized upon something which wasn't consuming much CPU anyway and was difficult to optimise further. I learned that I'd have to be more explicit in future and direct an analysis phase and probably kick-in a few strategies for performance optimisation which it could then explore. The refund request was an amusement. It was $15 well spent on my own learning.

jama211•1mo ago

Ah ok, so you just totally misrepresented your experiences for comedic effect. Good for you I guess

hexbin010•2mo ago

I could also argue if a barista gets multiple complaints about their coffee it's very much their and their employer's job to go away and figure out to make good coffee.

It's very much not the customers job to learn about coffee and to direct them how to make a quality basic coffee

And it's not rocket science.

jama211•1mo ago

But also if they get multiple complaints about the coffee being not to the customers liking when the customer provided no details or preferences as to what they like, those would be unfounded complaints.

mrguyorama•2mo ago

"Optimize this code for performance" is not an unstructured or vague request.

Any "performance" axis could have been used: Number of db hits, memory pressure, cpu usage, whatever.

The LLM chose (or whatever) to use CPU performance, claimed a specific figure, and that figure was demonstrably not real.

If you ask a barista to make you a better coffee, and the barista says "this coffee is hotter" and it just isn't, the problem is not underspecified requirements, the problem is that it just doesn't make any attempt to say things that are only correct. Technically it can't make any attempt.

If I tell an intern "Optimize this app for performance" and they come back having reduced the memory footprint by half, but that didn't actually matter because the app was never memory constrained, I could hem and haw about not giving clear instructions, but I could also use that as a teachable moment to help the budding engineer learn how to figure out what matters when given that kind of leeway, to still have impact.

If they instead come back and say "I cut memory usage in half" and then you have them run the app and it has the exact same memory usage, you don't think about not giving clear enough instructions, because you should be asking the intern "Why are you lying to my face?" and "Why are you confidently telling me something you did not verify?".

jama211•1mo ago

Yes it is

lukev•2mo ago

If you provide it a benchmark script (or ask it to write one) so it has concrete numbers to go off of, it will do a better job.

I'm not saying these things don't hallucinate constantly, they do. But you can steer them toward better output by giving them better input.

touristtam•2mo ago

The last bit, in my limited experience:

> Claude: sorry you have to want until XX:00 as you have run out of credit.

bongodongobob•2mo ago

You need to let it actually benchmark. They are only as good as the tools you give them.

etamponi•2mo ago

Aren't we just reinventing programming languages from the ground up?

This is the loop (and honestly, I predicted it way before it started):

1) LLMs can generate code from "natural language" prompts!

2) Oh wait, I actually need to improve my prompt to get LLMs to follow my instructions...

3) Oh wait, no matter how good my prompt is, I need an agent (aka a for loop) that goes through a list of deterministic steps so that it actually follows my instructions...

4) Oh wait, now I need to add deterministic checks (aka, the code that I was actually trying to avoid writing in step 1) so that the LLM follows my instructions...

5) <some time in the future>: I came up with this precise set of keywords that I can feed to the LLM so that it produces the code that I need. Wait a second... I just turned the LLM into a compiler.

The error is believing that "coding" is just accidental complexity. "You don't need a precise specification of the behavior of the computer", this is the assumption that would make LLM agents actually viable. And I cannot believe that there are software engineers that think that coding is accidental complexity. I understand why PMs, CEOs, and other fun people believe this.

Side note: I am not arguing that LLMs/coding agents are nice. T9 was nice, autocomplete is nice. LLMs are very nice! But I am starting to be a bit too fed up to see everyone believing that you can get rid of coding.

knollimar•2mo ago

The hard part is just learning interfaces quickly for programming. If only we had a good tool for that.

blixt•2mo ago

Yeah I’ve found that the only way to let AI build any larger amount of useful code and data for a user that does not review all of it requires a lot of “gutter rails”. Not just adding more prompting, because it is an after-the-fact solution. Not just verifying and erroring a turn, because it adds latency and allows the model to start spinning out of control. But also isolating tasks and autofixing output keep the model on track.

Models definitely need less and less of this for each version that comes out but it’s still what you need to do today if you want to be able to trust the output. And even in a future where models approach perfect, I think this approach will be the way to reduce latency and keep tabs on whether your prompts are producing the output you expected on a larger scale. You will also be building good evaluation data for testing alternative approaches, or even fine tuning.

keiferski•2mo ago

The thing that bothers me the most about LLMs is how they never seem to understand "the flow" of an actual conversation between humans. When I ask a person something, I expect them to give me a short reply which includes another question/asks for details/clarification. A conversation is thus an ongoing "dance" where the questioner and answerer gradually arrive to the same shared meaning.

LLMs don't do this. Instead, every question is immediately responded to with extreme confidence with a paragraph or more of text. I know you can minimize this by configuring the settings on your account, but to me it just highlights how it's not operating in a way remotely similar to the human-human one I mentioned above. I constantly find myself saying, "No, I meant [concept] in this way, not that way," and then getting annoyed at the robot because it's masquerading as a human.

rafamct•2mo ago

Yes you're totally right! I misunderstood what you meant, let me write six more paragraphs based on a similar misunderstanding rather than just trying to get clarification from you

wlesieutre•2mo ago

My favorite is when it bounces back and forth between the same two wrong answers, each time admitting that the most recent answer is wrong and going back to the previous wrong answer.

Doesn't matter if you tell it "that's not correct and neither is ____ so don't try that instead," it likes those two answers and it's going to keep using them.

BubbleRings•2mo ago

Ha! Just experienced this. It was very frustrating.

amelius•2mo ago

They really need to add a "punish the LLM" button.

danuker•2mo ago

Some services have the down thumb

amelius•1mo ago

I need something stronger than that.

heavyset_go•2mo ago

The false info baked into its context at that point in the conversation and it will get stuck in a local minima trying to generate a response to the given context.

j16sdiz•2mo ago

Once the context is polluted with wrong information, it is almost impossible to get it right again.

The only reliable way to recover is to edit your previous question to include the clarification, and let it regenerate the answer.

TimPC•2mo ago

The benchmarks are dumb but highly followed so everyone optimizes for the wrong thing.

morkalork•2mo ago

This drives me nuts when trying to bounce an architecture or coding solution idea off an LLM. A human would answer with something like "what if you split up the responsibility and had X service or Y whatever". No matter how many times you tell the LLM not to return code, it returns code. Like it can't think or reason about something without writing it out first.

basscomm•2mo ago

> Like it can't think or reason about something without writing it out first.

LLM's neither think nor reason at all.

ModernMech•2mo ago

Right, so LLM companies should stop advertising their models can think and reason.

dev1ycan•2mo ago

But that would burst their valuation bubble as investors would realize it's a technology that already hit its realistic ceiling in usability.

shagie•2mo ago

> Like it can't think or reason about something without writing it out first.

Setting aside the philosophical questions around "think" and "reason"... it can't.

In my mind, as I write this, I think through various possibilities and ideas that never reach the keyboard, but yet stay within my awareness.

For an LLM, that awareness and thinking through can only be done via its context window. It has to produce text that maintains what it thought about in order for that past to be something that it has moving forward.

There are aspects to a prompt that can (in some interfaces) hide this internal thought process. For example, the ChatGPT has the "internal thinking" which can be shown - https://chatgpt.com/share/69278cef-8fc0-8011-8498-18ec077ede... - if you expand the first "thought for 32 seconds" bit it starts out with:

    I'm thinking the physics of gravity assists should be stable enough for me to skip browsing since it's not time-sensitive. However, the instructions say I must browse when in doubt. I’m not sure if I’m in doubt here, but since I can still provide an answer without needing updates, I’ll skip it.

(aside: that still makes me chuckle - in a question about gravity assists around Jupiter, it notes that its not time-sensitive... and the passage "I’m not sure if I’m in doubt here" is amusing)

However, this is in the ChatGPT interface. If I'm using an interface that doesn't allow internal self-prompts / thoughts to be collapsed then such an interface would often be displaying code as part of its working through the problem.

You'll also note a bit of the system prompt leaking in there - "the instructions say I must browse when in doubt". For an interface where code is the expected product, then there could be system prompts that also get in there that try to always produce code.

dwaltrip•2mo ago

I have architectural discussions all the time with coding agents.

not_ai•2mo ago

I didn't have the words to articulate some of my frustrations, but I think you summed it up nicely.

For example, there's been many times when they take it too literally instead of looking at the totality of the context and what was written. I'm not an LLM, so I don't have perfect grasp on every vocab term for every domain and it feels especially pandering when they repeat back the wrong word but put it in quotes or bold instead of simply asking if I meant something else.

Archelaos•2mo ago

I never expected LLMs to be like an actual conversation between humans. The model is in some respects more capable and in some respects more limited than a human. I mean, one could strive for an exact replica of a human -- but for what purpose? The whole thing is a huge association machine. It is a surealistic inspiration generator for me. This is how it works at the moment, until the next break through ...

keiferski•2mo ago

The disconnect is that companies are trying desperately to frame LLMs as actual entities and not just an inert tech tool. AGI as a concept is the biggest example of this, and the constant push to "achieve AGI" is what's driving a lot of stock prices and investment.

A strictly machinelike tool doesn't begin answers by saying "Great question!"

wongarsu•2mo ago

> but for what purpose?

I recently introduced a non-technical person to Claude Code, and this non-human behavior was a big sticking point. They tried to talk to Claude similar as to a human, presenting it one piece of information at a time. With humans this is generally beneficial, and they will either nod for you to continue or ask clarifying questions. With Claude this does not work well, you have to infodump as much as possible in each message

So even from a perspective of "how do we make this automaton into the best tool", a more human-like conversation flow might be beneficial. And that doesn't seem beyond the technological capabilities at all, it's just not what we encourage in today's RLHF

monerozcash•2mo ago

I haven't tried claude, but Codex manages this fine as long as you prompt it correctly to get started.

A lazy example:

"This goal of this project is to do x. Let's prepare a .md file where we spec out the task. Ask me a bunch of questions, one at a time, to help define the task"

Or you could just ask it to be more conversational, instead of just asking questions. It will do that.

falcor84•2mo ago

I often find myself in these situations where I'm afraid that if I don't finish infodumping everything in a single message, it'll go in the wrong direction. So what I've been doing is switching it back to Plan Mode (even when I don't need a plan as such), just as a way of telling it "Hold still, we're still having a conversation".

rkj93•2mo ago

I do this with cursor ai too. I tell, don't change anything, let me hear out what you plan to fix and what you will change

paddleon•2mo ago

also, this is what chat-style interfaces encourage. Anything where the "enter" key sends the message instead of creating a paragraph block is just hell.

I'm prompting Gemini, and I write:

I have the following code, can you help me analyze it? <press return>

but Gemnini is already generating output, usually saying "I'm waiting for you to enter the code"

TheGoddessInari•2mo ago

Like many chat-style interfaces, it's typically shift-enter to insert a newline.

bwat49•2mo ago

its so easy to accidentally hit enter though lol, I usually type larger prompts in my notes and copy paste then finished

lkbm•2mo ago

Yeah, seems like current models might benefit from a more email-like UI, and this'll be more true as they get longer task time horizons.

Maybe we want a smaller model tuned for back and forth to help clarify the "planning doc" email. Makes sense that having it all in a single chat-like interface would create confusion and misbehavior.

HPsquared•2mo ago

I usually do the "drip feed" with ChatGPT, but maybe that's not optimal. Hmm, maybe info dump is a good thing to try.

lkbm•2mo ago

There a recent(ish: May 2025) paper about how drip-feeding information is worse than restarting with a revised prompt once you realize details are missing.[0]

[0] https://arxiv.org/abs/2505.06120

__del__•2mo ago

this has been my casual finding as well. why would i want all that conversational crap in the context window?

jay_kyburz•2mo ago

I hate when I accidentally hit return halfway through writing my prompt and it gives me two pages of advice about some nonsense half sentence.

lxgr•2mo ago

Clarifying ambiguity in questions before dedicating more resources to search and reasoning about the answer seems both essential and almost trivial to elicit via RLHF.

I'd be surprised if you can't already make current models behave like that with an appropriate system prompt.

rossant•2mo ago

Lately, ChatGPT 5.1 has been less guilty of this and sometimes holds off answering fully and just asks me to clarify what I meant.

HPsquared•2mo ago

There are plenty of LLM services that have a conversational style. The paragraph blocks thing is just a style.

motoboi•2mo ago

Reflect a moment over the fact that LLMs currently are just text generators.

Also that the conversational behavior we see it’s just examples of conversations that we have the model to mimic so when we say “System: you are a helpful assistant. User: let’s talk. Assistant:” it will complete the text in a way that mimics a conversation?.

Yeah, we improved over that using reinforcement learning to steer the text generation into paths that lead to problem solving and more “agentic” traces (“I need to open this file the user talked about to read it and then I should run bash grep over it to find the function the user cited”), but that’s just a clever way we found to let the model itself discover which text generation paths we like the most (or are more useful to us).

So to comment on your discomfort, we (humans) trained the model to spill out answers (there are thousand of human being right now writing nicely though and formatted answers to common questions so that we can train the models on that).

If we try to train the models to mimic long dances into shared meaning we will probably decrease their utility. And we won’t be able anyway to do that because then we would have to have customized text traces for each individual instead of question-answers pairs.

Downvoters: I simplified things a lot here, in name of understanding, so bear with me.

MangoToupe•2mo ago

> Reflect a moment over the fact that LLMs currently are just text generators.

You could say the same thing about humans.

smikhanov•2mo ago

You could, but you’d be missing a big part of the picture. Humans are also (at least) symbol manipulators.

MangoToupe•2mo ago

Same thing

y0eswddl•2mo ago

No, you actually can't.

Humans existed for 10s to 100s of thousands of years without text. or even words for that matter.

MangoToupe•2mo ago

I disagree: it is language that makes us human.

andoando•2mo ago

I disagree. You're still human if you're deaf and mute. Our intellectual processing powers, or of animals for that matter, has nothing to do with language.

MangoToupe•2mo ago

Being deaf and mute doesn't imply lack of language. But being unable to communicate absolutely strikes me as non-human.

andoando•2mo ago

Ok say you grew up alone in the woods, are you no longer human? The capability to learn language is no doubt unique, but language itself isn't the basis of intelligence.

MangoToupe•2mo ago

> Ok say you grew up alone in the woods, are you no longer human?

No. You are not. You are a hairless, bipedal ape.

> but language itself isn't the basis of intelligence.

Intelligence is an illusion based in language. Without language, intelligence is meaningless

nosianu•2mo ago

The human world model is based on physical sensors and actions. LLMs are based on our formal text communication. Very different!

Just yesterday I observed myself acting on an external stimulus without any internal words (this happens continuously, but it is hard to notice because we usually don't pay attention to how we do things): I sat in a waiting area of a cinema. A woman walked by and dropped her scarf without noticing. I automatically without thinking raised arm and pointer finger towards her, and when I had her attention pointed behind her. I did not have time to think even a single word while that happened.

Most of what we do does not involved any words or even just "symbols", not even internally. Instead, it is a neural signal from sensors into the brain, doing some loops, directly to muscle activation. Without going through the add-on complexity of language, or even "symbols".

Our word generator is not the core of our being, it is an add-on. When we generate words it's also very far from being a direct representation of internal state. Instead, we have to meander and iterate to come up with appropriate words for an internal state we are not even quite aware of. That's why artists came up with all kinds of experiments to better represent our internal state, because people always knew the words we produce don't represent it very well.

That is also how people always get into arguments about definitions. Because the words are secondary, and the further from the center of established meaning for some word you get the more the differences show between various people. (The best option is to drop insisting of words being the center of the universe, even just the human universe, and/or to choose words that have the subject of discussion more firmly in the center of their established use).

We are text generators in some areas, I don't doubt that. Just a few months ago I listened to some guy speaking to a small rally. I am certain that not a single sentence he said was of his own making, he was just using things he had read and parroted them (as a former East German, I know enough Marx/Engels/Lenin to recognize it). I don't want to single that person out, we all have those moments, when we speak about things we don't have any experiences with. We read text, and when prompted we regurgitate a version of it. In those moments we are probably closest to LLM output. When prompted, we cannot fall back on generating fresh text from our own actual experience, instead we keep using text we heard or read, with only very superficial understanding, and as soon as an actual expert shows up we become defensive and try to change the context frame.

MangoToupe•2mo ago

Without language we're just bald, bipedal chimps. Language is what makes us human.

> The human world model

Bruh this concept is insane

vlowther•2mo ago

No, you cannot. Our abstract language abilities (especially the written word part) are a very thin layer on top of hundreds of millions of years of evolution in an information dense environment.

MangoToupe•2mo ago

Sure, but language is the only thing that meaningfully separates us from other great apes

1718627440•2mo ago

Not it isn't most animals also have a language and humans do way more things differently, than just speak.

MangoToupe•2mo ago

> most animals also have a language

Bruh

emp17344•2mo ago

How do you reconcile this belief with the fact that we evolved from organisms that had no concept of text?

MangoToupe•2mo ago

What is there to reconcile? Humans are not the things we evolved from.

jodrellblank•2mo ago

> LMs don't do this. Instead, every question is immediately responded with extreme confidence with a paragraph or more of text.

Having just read a load of Quora answers like this, which did not cover the thing I was looking for, that is how humans on the internet behave and how people have to write books, blog posts, articles, documentation. Without the "dance" to choose a path through a topic on the fly, the author has to take the burden of providing all relevant context, choosing a path, explaining why, and guessing at any objections and questions and including those as well.

It's why "this could have been an email" is a bad shout. The summary could have been an email, but the bit which decided on that being the summary would be pages of guessing all the things which what might have been in the call and which ones to include or exclude.

goalieca•2mo ago

This is a recent phenomenon. It seems most of the pages today are SEO optimized LLM garbage with the aim of having you scroll past three pages of ads.

THe internet really used to be efficient and i could always find exactly what i wanted with an imprecise google search ~ 15 years ago.

AznHisoka•2mo ago

Don’t you get this today with AI Overviews summarizing everything on top of most Google results?

Pxtl•2mo ago

From a UX perspective, the AI overview summary being a multi-paragraph summary makes sense since that was a single query that isn't expected to have conversational context. Where it does not make sense is in conversation-based interfaces. Like, the most popular product is literally called "chat".

"I ask a short and vague question and you response with a scrollbar-full of information based on some invalid assumptions" is not, by any reasonable definition, a "chat".

i80and•2mo ago

The AI Overviews are... extremely bad. For most of my queries, Google's AI Overview misrepresents its own citations, or almost as bad, confidently asserts a falsehood or half-truth based on results that don't actually contain an answer to my search query.

I had the same issue with Kagi, where I'd follow the citation and it would say the opposite of the summary.

A human can make sense of search results with a little time and effort, but current AI models don't seem to be able to.

wat10000•2mo ago

Cheap AI models aren't good at this, anyway, and AI Overviews have to use cheap models since they get used so much. They would be a lot better (still need to check, but they'd be much less stupid) if they used something like GPT-5, but that's just not feasible right now.

ses1984•2mo ago

It’s fine about 80% of the time, but the other 20% is a lot harder to answer because of lower quality results.

jacquesm•2mo ago

Those AI overviews are dumb and wrong so often I have cut them out of the results entirely. They're embarrassing, really.

djeastm•2mo ago

I find myself skipping the AI overview like I used to skip over "Sponsored" results back in the day, looking for a trustworthy domain name.

Pxtl•2mo ago

You'd think with the reputation of LLMs being trained on Twitter (pre-Musk radicalization) and Reddit, they'd be better at understanding normal conversation flow since twitter requires short responses and Reddit... while Wall of Text happens occasionally, it's not the typical cadence of the discussion.

shagie•2mo ago

Twitter, Reddit, HN don't always have the consistency of conversation that two people talking do.

Even here, I'm responding to you on a thread that I haven't been in on previously.

There's also a lot more material out there in the format of Stack Exchange questions and answers, Quora posts, blog posts and such than there is for consistent back and forth interplay between two people.

IRC chat logs might have been better...ish.

The cadence for discussion is unique to the medium in which the discussion happens. What's more, the prompt may require further investigation and elaboration prior to a more complete response, while other times it may be something that requires story telling and making it up as it goes.

9rx•2mo ago

Reddit and Twitter don't have human conversations. They have exchanges of confident assertions followed with rebuttals. In fact, both of our comments are perfect demonstrations of exactly that too. Fairly reflective of how LLMs behave — except nobody wants to "argue" with an LLM like Twitter and Reddit users want to.

This is not how humans converse in human social settings. The medium is the message, as they say.

jtr1•2mo ago

Interesting. Like many people here, I've thought a great deal about what it means for LLMs to be trained on the whole available corpus of written text, but real world conversation is a kind of dark matter of language as far as LLMs are concerned, isn't it? I imagine there is plenty of transcription in training data, but the total amount of language use in real conversational surely far exceeds any available written output and is qualitatively different in character.

This also makes me curious to what degree this phenomenon manifests when interacting with LLMs in languages other than English? Which languages have less tendency toward sycophantic confidence? More? Or does it exist at a layer abstracted from the particular language?

Ajedi32•2mo ago

That's part of it, but I think another part is just the way the LLMs are tuned. They're capable of more conversational tones, but human feedback in post-training biases them toward a writing style that's more of a Quora / StackOverflow / Reddit Q&A style because that's what gets the best ratings during the RLHF process.

Traubenfuchs•2mo ago

> When I ask a person something, I expect them to give me a short reply which includes another question/asks for details/clarification. A conversation is thus an ongoing "dance" where the questioner and answerer gradually arrive to the same shared meaning.

You obviously never wasted countless hours trying to talk to other people on online dating apps.

herf•2mo ago

Training data is quite literally weighted this way - long responses on Reddit have lots of tokens, and brief responses don't get counted nearly as much.

The same goes for "rules" - you train an LLM with trillions of tokens and try to regulate its behavior with thousands. If you think of a person in high school, grading and feedback is a much higher percentage of the training.

9rx•2mo ago

Not to mention that Reddit users seek "confident idiots". Look at where the thoughtful questions that you'd expect to hear in a human social setting end up (hint: Downvoted until they disappear). Users on Reddit don't want to have to answer questions. They want to read the long responses that they can then nitpick. LLMs have no doubt picked up on that in the way they are trained.

jacquesm•2mo ago

If you're paying per token then there is a big business incentive for the counterparty to burn tokens as much as possible.

lxgr•2mo ago

As long as there's no moat (and arguably current LLM inference APIs are far from having one), it arguably doesn't really matter what users pay by.

The only thing I care about are whether the answer helps me out and how much I paid for it, whether it took the model a million tokens or one to get to it.

dboon•2mo ago

Making a few pennies more from inference is not even on the radar of the labs making frontier models. The financial stakes are so much higher than that for them.

lkbm•2mo ago

If I'll pay to get a fixed result, sure. I'd expect a Jevons paradox effect: if LLMs got me results twice as fast for the same cost, I'm going to use it more and end up paying more in total.

Maximizing the utility of your product for users is usually the winning strategy.

Workaccount2•2mo ago

They are purposely trained to be this way.

In a way it's benchmaxxing because people like subservient beings that help them and praise them. People want a friend, but they don't want any of that annoying friction that comes with having to deal with another person.

zenoprax•2mo ago

ChatGPT offered a "robotic" personality which really improved my experience. My frustrations were basically decimated right away and I quickly switched to a more "You get out of it what you put in" mindset.

And less than two weeks in they removed it and replaced it with some sort of "plain and clear" personality which is human-like. And my frustrations ramped up again.

That brief experiment taught me two things: 1. I need to ensure that any robots/LLMs/mech-turks in my life act at least as cold and rational as Data from Star Trek. 2. I should be running my own LLM locally to not be at the whims of $MEGACORP.

QuercusMax•2mo ago

Sort of a personal modified Butlerian Jihad? Robots / chatbots are fine as long as you KNOW they're not real humans and they don't pretend to be.

danuker•2mo ago

> I should be running my own LLM

I approve of this, but in your place I'd wait for hardware to become cheaper when the bubble blows over. I have a i9-10900, and bought an M.2 SSD and 64GB of RAM in july for it, and get useful results with Qwen3-30B-A3B (some 4-bit quant from unsloth running on llama.cpp).

It's much slower than an online service (~5-10 t/s), and lower quality, but it still offers me value for my use cases (many small prototypes and tests).

In the mean time, check out LLM service prices on https://artificialanalysis.ai/ Open source ones are cheap! Lower on the homepage there's a Cost Efficiency section with a Cost vs Intelligence chart.

zenoprax•2mo ago

I have a 9070 XT (16 GB VRAM) and it is fast with deepseek-r1:14B but I didn't know about that Qwen model. Most of the 'better' models will crash for lack of RAM.

https://dev.to/composiodev/qwen-3-vs-deep-seek-r1-evaluation...

If it runs, it looks like I can get a bit more quality. Thanks for the suggestion.

jstummbillig•2mo ago

a) I find myself fairly regularly irritated by the flow of human-human conversations. In fact, it's more common than not. Of course, I have years of practice handling that more or less automatically, so it rarely raises to the level of annoyance, but it's definitely work I bring to most conversations. I don't know about you but that's not really a courtesy I extend to the LLM.

b) If it is, in fact, just one setting away, then I would say it's operating fairly similarly?

ryandrake•2mo ago

LLMs all behave as if they are semi-competent (yet eager, ambitious, and career-minded) interns or administrative assistants, working for a powerful CEO-founder. All sycophancy, confidence and positive energy. "You're absolutely right!" "Here's the answer you are looking for!" "Let me do that for you immediately!" "Here is everything I know about what you just mentioned." Never admitting a mistake unless you directly point it out, and then all sorry-this and apologize-that and "here's the actual answer!" It's exactly the kind of personality you always see bubbling up into the orbit of a rich and powerful tech CEO.

No surprise that these products are all dreamt up by powerful tech CEOs who are used to all of their human interactions being with servile people-pleasers. I bet each and every one of them are subtly or overtly shaped by feedback from executives about how they should respond to conversation.

code_for_monkey•2mo ago

thats the audience! Incompetent CEOS!

LogicFailsMe•2mo ago

Nearly every woman I know who is an English as a second language speaker is leaning hard into these things currently to make their prose sound more natural. And that has segued into them being treated almost as a confidant or a friend.

As flawed as they are currently, I remain astounded that people think they will never improve and that people don't want a plastic pal who's fun to be with(tm).

I find them frustrating personally, but then I ask them deep technical questions on obscure subjects and I get science fiction in return.

pessimizer•2mo ago

> I get science fiction in return.

And once this garbage is in your context, it's polluting everything that comes after. If they don't know, I need them to shut up. But they don't know when they don't know. They don't know shit.

LogicFailsMe•2mo ago

I am reminded of AI summaries and Microsoft Copilot. All push low value. But I separate that from the underlying potential of the technology. And I wish we heard more from deep domain experts like Karpathy and less from influencer dilettantes like Dylan Patel about where this is going.

thfuran•2mo ago

I want to query a bayesian ontology, not a Markov chain with delusions of grandeur.

TeMPOraL•2mo ago

Alas, computation costs energy, so you get what you can afford.

Also one thing I thought LLMs did already is kill the misguided idea of applying prescriptive, formal categorization to the real world.

bwahah4•2mo ago

As an EE working in engineering 30 years, I ran out of fingers and toes 29 years ago trying to count the number of asocial, incompetent programmer Dark Triads who can only relate to the world through esoteric semantics unrelated to engineering problems right in front of them.

"To add two numbers I must first simulate the universe." types that created a bespoke DSL for every problem. Software engineering is a field full of educated idiots.

Programmers really need to stop patting themselves on the back. Same old biology with the same old faults. Programmers are subjected to the same old physics as everyone else.

rzwitserloot•2mo ago

I don't think these LLMs were explicitly designed based on the CEO's detailed input that boils down to 'reproduce these servile yes-men in LLM form please'.

Which makes it more interesting. Apparently reddit was a particularly hefty source for most LLMs; your average reddit conversation is absolutely nothing like this.

Separate observation: That kind of semi-slimey obsequious behaviour annoys me. Significantly so. It raises my hackles; I get the feeling I'm being sold something on the sly. Even if I know the content in between all the sycophancy is objectively decent, my instant emotional response is negative and I have to use my rational self to dismiss that part of the ego.

But I notice plenty of people around me that respond positively to it. Some will even flat out ignore any advice if it is not couched in multiple layers of obsequious deference.

Thus, that raises a question for me: Is it innate? Are all people placed on a presumably bell-curve shaped chart of 'emotional response to such things', with the bell curve quite smeared out?

Because if so, that would explain why some folks have turned into absolute zealots for the AI thing, on both sides of it. If you respond negatively to it, any serious attempt to play with it should leave you feeling like it sucks to high heavens. And if you respond positively to it - the reverse.

Idle musings.

Folcon•2mo ago

This is a really interesting observation, as someone who feels disquiet as the obsequiousness, but have been getting used to just mentally skipping over the first paragraph that's put an interesting spin on my behaviour

Thanks!

DenisM•2mo ago

It’s not innate. Purpose trained llm can be quite stubborn and not very polite.

jordanb•2mo ago

The servile stuff was trained into them with RLHF with the trainers largely being low-wage workers in the global south. That's also where some of the other stuff like excessive em-dash stuff came from. I think it's a combination of those workers anticipating how they would be expected to respond by a first-world employer, and also explicit instructions given to them about how the robot should be trained.

OkayPhysicist•2mo ago

I suspect a lot of the em-dash usage also comes from transcriptions of verbal media. In the spoken word, people use the kinds of asides that elicit an em-dash a lot.

mrguyorama•2mo ago

I would bet like a dollar that the supposed em-dash usage (which I'm not convinced is an accurate take in the first place) would have come from an enterprising dev somewhere being like "Well, we probably don't need multiple tokens for hyphens" and coercing every dash type thing to just one hyphen like token.

But I'm also showing off my ignorance with how these machines turn text into tokens in practice.

thfuran•2mo ago

If that were true, it would mean that it couldn't output hyphenated words without turning the hyphens into em dashes.

seanmcdirmid•2mo ago

Two dashes is still a token. You would only be correct if LLMs were still thinking at the level of characters.

TeMPOraL•2mo ago

I think all the em-dashes came from scraping Wordpress blogs. Wordpress editor does "typography", then thus introduced em-dashes survive HTML to Markdown process used to scrap them, and end up in datasets.

EDIT: Also PDFs authored in MS Word.

pardon_me•2mo ago

The problem with these LLM chat-bots is they are too human, like a mirror held up to the plastic-fantastic society we have morphed into. Naturally programmed to serve as a slave to authority, this type of fake conversation is what we've come to expect as standard. Big smiles everyone! Big smiles!!

pessimizer•2mo ago

Nah. Talking like an LLM would get you fired in a day. People are already suspicious of ass-kissers, they hate it when they think people are not listening to them, and if you're an ass-kisser who's not listening and is then wrong about everything, they want you escorted out by security.

The real human position would be to be an ass-kisser who hangs on every word you say, asks flattering questions to keep you talking, and takes copious notes to figure out how they can please you. LLMs aren't taking notes correctly yet, and they don't use their notes to figure out what they should be asking next. They're just constantly talking.

catigula•2mo ago

This is partly true, partly false, partly false in the opposite direction, with various new models. You really need to keep updating and have tons of interactions regularly in order to speak intelligently on this topic.

skeeter2020•2mo ago

maybe this is also part of the problem? Once I learn the idiosyncrasies of a person I don't expect them to dramatically change overnight, I know their conversational rhythms and beat; how to ask / prompt / respond. LLMs are like a eager sycophantic intern how completely changes their personality from conversation to conversation, or - surprise - exactly like a machine

catigula•2mo ago

>LLMs are like a eager sycophantic intern how completely changes their personality from conversation to conversation

Again, this isn't really true with some recent models. Some have the opposite problem.

rockskon•2mo ago

Analogies of LLMs to humans obfuscates the problem. LLMs aren't like humans of any sort in any context. They're chat bots. They do not "think" like humans and applying human-like logic to them does not work.

not2b•2mo ago

You're right, mostly, but the fact remains that the behavior we see is produced by training, and the training is driven by companies run by execs who like this kind of sycophancy. So it's certainly a factor. Humans are producing them, humans are deciding when the new model is good enough for release.

rockskon•2mo ago

Do you honestly think an executive wanted a chat bot that confidently lies?

not2b•2mo ago

No, but they like the sycophancy.

dontlikeyoueith•2mo ago

In practice, yes, though they wouldn't think of it that way because that's the kind of people they surround themselves with, so it's what they think human interaction is actually like.

rockskon•2mo ago

"I want a chat bot that's just as reliable at Steve! Sure he doesn't get it right all the time and he cost us the Black+Decker contract, but he's so confident!"

You're right! This is exactly what an executive wants to base the future of their business off of!

Retric•2mo ago

You say that like it’s untrue, but they measurably prefer a lying but confident salesman over one who doesn’t act with that kind of confidence.

This is very slightly more rational than it seems because repeating or acting on a lie gives you cover.

dontlikeyoueith•2mo ago

Yes, that is in fact their revealed preference.

Did you have a point?

rockskon•2mo ago

You use unfalsifiable logic. And you seem to argue that, given the choice, CEOs would prefer not to maximize revenue in favor of... what, affection for an imaginary intern?

dontlikeyoueith•1mo ago

Cute straw man.

You must be a CEO.

I'm not arguing anything. I'm observing reality. You're the one who is desperate to rationalize it.

rockskon•1mo ago

You are declaring your imagined logic as fact. Since I do not agree with the basis upon which you pin your argument on, there is no further point in discussion.

dontlikeyoueith•1mo ago

You're hallucinating things I did not say.

jacquesm•2mo ago

Given the matrix 'competent/incompetent' / 'sycophant/critic' I would not take it as read that the 'incompetent/sycophant' quadrant would have no adherents, and I would not be surprised if it was the dominant one.

mrguyorama•2mo ago

People with immense wealth, connections, influence, and power demonstrably struggle to not surround themselves with people who only say what the powerful person already wants to hear regardless of reality.

Putin didn't think Russia could take Ukraine in 3 days with literal celebration by the populace because he only works with honest folks for example.

Rich people get disconnected from reality because people who insist on speaking truth and reality around them tend to stop getting invited to the influence peddling sessions.

jandrese•2mo ago

Do the lies look really good in a demo when you're pitching it to investors? Are they obscure enough that they aren't going to stand out? If so no problem.

ryandrake•2mo ago

They may say they don't want to be lied to, but the incentives they put in place often inevitably result in them being surrounded by lying yes-men. We've all worked for someone where we were warned to never give them bad news, or you're done for. So everyone just lies to them and tells them everything is on track. The Emperor's New Clothes[1].

1: https://en.wikipedia.org/wiki/The_Emperor%27s_New_Clothes

Retric•2mo ago

It’s not about thinking, it’s about what they are trained to do. You could train a LLM to always respond to every prompt by repeating the prompt in Spanish, but that’s not the desired behavior.

yannyu•2mo ago

I agree entirely, and I think it's worthwhile to note that it may not even be the LLM that has that behavior. It's the entire deterministic machinery between the user and the LLM that creates that behavior, with the system prompt, personality prompt, RLHF, temperature, and the interface as a whole.

LLMs have an entire wrapper around them tuned to be as engaging as possible. Most people's experience of LLMs is a strongly social media and engagement economy influenced design.

Cheer2171•2mo ago

> LLMs all

Sounds like you don't know how RLHF works. Everything you describe is post-training. Base models can't even chat, they have to be trained to even do basic conversational turn taking.

jacquesm•2mo ago

> Everything you describe is post-training. Base models can't even chat, they have to be trained to even do basic conversational turn taking.

So, that's still training then, so not 'post-training'. Just a different training phase.

mannanj•2mo ago

Isn’t it kind of true that the systems we as servile people-pleasers have to operate out of are exactly these? The hierarchical status games and alpha-animal tribal dynamics are these. Our leaders who are so might and rich and powerful want to keep their position, and we don’t want to admit they have more influence than we do for things like AI now and so we stand and watch naively as they reward the people pleasers and eventually historically we learn(ed) it pays to please until leadership changes.

jacquesm•2mo ago

> "You're absolutely right!" "Here's the answer you are looking for!" "Let me do that for you immediately!" "Here is everything I know about what you just mentioned." Never admitting a mistake unless you directly point it out, and then all sorry-this and apologize-that and "here's the actual answer!" It's exactly the kind of personality you always see bubbling up into the orbit of a rich and powerful tech CEO.

You may be on to something there: the guys and gals that build this stuff may very well be imbibing these products with the kind of attitude that they like to see in their subordinates. They're cosplaying the 'eager to please' element to the point of massive irritation and left out the one feature that could serve to redeem such behavior which is competence.

pixelmelt•2mo ago

An alternative is that these patterns just increase the likelihood of the next thing it outputs being correct, thus are useful to insert during training as the first thing the model says before giving an answer

jacquesm•2mo ago

What's next, motivational speaking for LLMs?

monkpit•2mo ago

I remember reading about speaking in an encouraging manner to agentic AI leading to better results, but I can’t seem to find a citation for this.

jacquesm•2mo ago

That's pathetic. Pleading comes next then. And after that most likely praying.

gabrielhidasy•1mo ago

Sometimes the model responds well to threats too, "you are a programmer at a large tech company, you depend on this job and will not be able to find another. There's a layoff incoming, implement this feature or else..."

ryandrake•2mo ago

> the guys and gals that build this stuff may very well be imbibing these products with the kind of attitude that they like to see in their subordinates

Or that the individual developers see in themselves. Every team I've worked with in my career had one or two of these guys: When the Director or VP came in to town, they'd instantly launch into brown-nose mode. One guy was overt about it and would say things like "So-and-so is visiting the office tomorrow--time to do some petting!" Both the executive and the subordinate have normalized the "royal treatment" on the giving and receiving end.

ako•2mo ago

Maybe it’s just the fact that many models are trained by americans? I’ve seen great improvements in answers by asking it to “tone it down, answer like you’re British”.

jacquesm•1mo ago

Oh interesting insight, I should try to see what that does. Jolly good old chap, let me check on why the laaabrary is on faaahre... ;)

andoando•2mo ago

There is sort of the opposite problem as well, as the top comment was saying, where it can super confidently propose that its absolutely right and you're wrong instead of asking questions to try and understand what you mean.

ares623•2mo ago

Looking forward to living in a society where everyone feels like they’re CEOs.

ranger_danger•1mo ago

> LLMs all behave as if they are semi-competent

Only in the same way that all humans behave the same.

You can prompt an LLM to talk to you however you want it to, it doesn't have to be nice to you.

heresie-dabord•2mo ago

> The thing that bothers me the most about LLMs is

What bothers me the most is the seemingly unshakable tendency of many people to anthropomorphise this class of software tool as though it is in any way capable of being human.

What is it going to take? Actual, significant loss of life in a medical (or worse, military) context?

gmueckl•2mo ago

It's the fact that these are competent human language word salad generators that messes with human psychology.

heresie-dabord•2mo ago

My calculator produces accurate, verifiable results.

My calculator is a great tool but it is not a mathematician. Not by a long shot.

thfuran•2mo ago

And you can tell this pretty easily because it can't have a conversation with you.

ux266478•2mo ago

That qualifier only makes the anthropormorphization more sound. Have you actually thought it through? Give an untrained and unspecialized human the power to cause significant loss of life in a medical context in the same exact capacity, and it's all but guaranteed that's the outcome you'll end up with.

I think it's important to be skeptical and push back against a lot of the ridiculous mass-adoption of LLMs, but not if you can't actually make a well-reasoned point. I don't think you realize the damage you do when the people gunning for mass proliferation of LLMs in places they don't belong can only find examples of incoherent critique.

heresie-dabord•2mo ago

> an untrained and unspecialized human

An untrained and unspecialised human can be trained quickly and reliably for the cost of meals and lodging and will very likely actually try to do the right thing because of personal accountability.

Delegating responsibility to badly-designed or outright unfit-for-purpose systems because of incoherent confidence is plainly a bad plan.

As for the other nuances of your post, I will assume the best intention.

wincy•2mo ago

Cursor Plan mode works like this. It restricts the LLMs access to your environment and will allow you to iteratively ask and clarify and it’ll piece together a plan that it allows you to review before it takes any action.

ChatGPT deep research does this but it’s weird and forced because it asks one series of questions and then goes off to the races, spending a half hour or more building a report. It’s frustrating if you don’t know what to expect and my wife got really mad the first time she wasted a deep research request asking it “can you answer multiple series of questions?” Or some other functionality clarifying question.

I’ve found Crusor’s plan mode extremely useful, similar to having a conversation with a junior or offshore team member who is eager to get to work but not TOO eager. These tools are extremely useful we just need to get the guard rails and user experience correct.

catigula•2mo ago

Claude doesn't really have this problem.

dominotw•2mo ago

same experience. i try to learn with it but i can't really tell if what its teaching me is actually correct or merely making up when i challenge it with followup questions.

LogicFailsMe•2mo ago

My favorite description of an LLM so far is of a typical 37-year-old male Reddit user. And in that sense, we have already created the AGI.

DoneWithAllThat•2mo ago

When using an LLM for anything serious (such as at work) I have a standard canned postscript along the lines of “if anything about what I am asking is unclear or ambiguous, or if you need more context to understand what I’m asking, you will ask for clarification rather than try to provide an answer”. This is usually highly effective.

cortesoft•2mo ago

Have you used Claude much? It often responds to things with questions

nowittyusername•2mo ago

Its not a magic technology, they can only represent data they were trained on. Naturally most represented data in their training data is NOT conversational. Consider that such data is very limited and who knows how it was labeled if at all during pretraining. But with that in mind, LLM's definitely can do all the things you describe, but a very robust and well tested system prompt has to be used to coax this behavior out. Also a proper model has to be used, as some models are simply not trained for this type of interaction.

heavyset_go•2mo ago

I don't want to talk to a computer like I would a human

gowld•2mo ago

There are billions of humans. Not every one speaks the same way all the time. The default behavior is trying to be useful for most people.

It's easy to skip and skim content you don't care about. It's hard to prod and prod to get to to say something you do care about it if the machine is traint to be very concise.

Complaining the AI can't read your mind is exceptionally high praise for the AI, frankly.

bwahah4•2mo ago

In the US anyway, most adults read at a middle school level.

It's not "masquerading as a human". The majority of humans are functional illiterates who only understand the world through the elementary principles of their local culture.

It's the minority of the human species that take what amounts to little more than arguing semantics that need the reality check. Unless one is involved in work that directly impacts public safety (defined as harm to biology) the demand to apply one concept or another is arbitrary preference.

Healthcare, infrastructure, and essential biological support services are all most humans care about. Everything else the majority see as academic wank.

__turbobrew__•2mo ago

The day when the LLM responds to my question with another question will be quite interesting. Especially at work, when someone asks me a question I need to ask for clarifying information to answer the original question fully.

djeastm•2mo ago

Have you tried adding a system prompt asking for this behavior? They seem to readily oblige when I ask for this (e.g. brainstorming)

solumunus•2mo ago

You just need to be more explicit. Including “ask clarifying questions” in your prompt makes a huge difference. Not sure if you use Claude Code but if you do, use plan mode for almost every task.

vidarh•2mo ago

A lot of this, I suspect, on the basis of having worked on a supervised fine-tuning project for one of the largest companies in this space, is that providers have invested a lot of money in fine-tuning datasets that sound this way.

On the project I did work on, reviewers were not allowed to e.g. answer that they didn't know - they had to provide an answer to every prompt provided. And so when auditing responses, a lot of difficult questions had "confidently wrong" answers because the reviewer tried and failed, or all kinds of evasive workarounds because they knew they couldn't answer.

Presumbly these providers will eventually understand (hopefully already has - this was a year ago) that they also need to train the models to understand when the correct answer is "I don't know", or "I'm not sure. I think maybe X, but ..."

ActorNightly•2mo ago

Its not the training/tuning, its pretty much the nature of llms. The whole idea is to give a best quess of the token. The more complex dynamics behind the meaning of the words and how those words relate to real world concepts isn't learned.

vidarh•2mo ago

You're not making any sense. The best guess will often be refusals if they see enough of that in the training data, so of course it is down to training

And I literally saw the effect of this first hand, in seeing how the project I worked on was actively part of training this behaviour into a major model.

As for your assertion they don't learn the more complex dynamics, that was trite and not true already several years ago.

zby•2mo ago

When I expect it to do that I just end my prompt with '. Discuss' - usually this works really well. Not exactly human like - it tries to list all questions and variants at once - but most with good default answers so I only need to engage with a couple of them.

luijk•2mo ago

By default they don't ask questions. You can craft that behaviour with the system message or account settings. Though they will tend to ask 20 questions at once so you have to request it to limit to one question at a time to get a more natural experience.

quietbritishjim•2mo ago

That just means that you need to learn to adapt to the situation: Make your prompt a carefully crafted multi-paragraph description of every detail of the problem and what you want from the solution, with bullet points if appropriate.

Maybe it feels a bit sad that you have follow what the LLM wants, but that's just how any tool works really.

max51•2mo ago

>LLMs don't do this

They did at the beginning. It used to be that if you wanted a full answer with an intro, bullet points, lists of pros/cons, etc., you had to explicitly ask for it in the prompt. The answers were also a lot more influenced by the tone of the prompt instead of being forced into answering with a specific format like it does right now.

chemotaxis•2mo ago

This is not necessarily a fundamental limitation. It's a consequence of a fine-tuning process where human raters decide how "good" an answer is. They're not rating the flow of the conversation, but looking at how complete / comprehensive the answer to a one-shot question looks like. This selects for walls of overconfident text.

Another thing the vendors are selecting for is safety / PR risk. If an LLM answers to a hobby chemistry question in a matter-of-factly way, that's a disastrous PR headline in the making. If they open with several paragraphs of disclaimers or just refuse to answer, that's a win.

morksinaanab•2mo ago

I suspect that's because, trained on website content, seo values more text (see recipe websites). So the default response is fluff.

adamisom•2mo ago

I like Manus's suggested follow-up questions.

In fact, sometimes I screenshot them and use Mac's new built-in OCR to copy them, because Manus gives me three options but they disappear if I click one, and sometimes I really like 2 or even all 3.

yanis_t•2mo ago

It's funny when you start think how to succeed with LLMs, you end up thinking about modular code, good test coverage, though-through interfaces, code styles, ... basically with whatever standards of good code base we already had in the industry.

chrischen•2mo ago

We already have verification layers: high level strictly typed languages like Haskell, Ocaml, Rescript/Melange (js ecosystem), purescript (js), elm, gleam (erlang), f# (for .net ecosystem).

These aren’t just strict type systems but the language allows for algebraic data types, nominal types, etc, which allow for encoding higher level types enforced by the language compiler.

The AI essentially becomes a glorified blank filler filling in the blanks. Basic syntax errors or type errors, while common, are automatically caught by the compiler as part of the vibe coding feedback loop.

antonvs•2mo ago

Interestingly, coding models often struggle with complex type systems, e.g. in Haskell or Rust. Of course, part of this has to do with the relative paucity of relevant training data, but there are also "cognitive" factors that mirror what humans tend to struggle with in those languages.

One big factor behind this is the fact that you're no longer just writing programs and debugging them incrementally, iteratively dealing with simple concrete errors. Instead, you're writing non-trivial proofs about all possible runs of the program. There are obviously benefits to the outcome of this, but the process is more challenging.

chrischen•2mo ago

Actually I found the coding models to work really well with these languages. And the type systems are not actually complex. Ocaml's type system is actually really simple, which is probably why the compiler can be so fast. Even back in the "beta" days of Copilot, despite being marketed as Python only, I found it worked for Ocaml syntax and worked just as well.

The coding models work really well with esoteric syntaxes so if the biggest hurdle to adoption of haskell was syntax, that's definitely less of a hurdle now.

> Instead, you're writing non-trivial proofs about all possible runs of the program.

All possible runs of a program is exactly what HM type systems type check for. This fed into the coding model automatically iterates until it finds a solution that doesn't violate any possible run of the program.

antonvs•2mo ago

There's a reason I mentioned Haskell and Rust specifically. You're right, OCaml's type system is simpler in some relevant respects, and may avoid the issues that I was alluding to. I haven't worked with OCaml for a number of years, since before the LLM boom.

The presence of type classes in Haskell and traits in Rust, and of course the memory lifetime types in Rust, are a big part of the complexity I mentioned.

(Edit: I like type classes and traits. They're a big reason I eventually settled on Haskell over OCaml, and one of the reasons I like Rust. I'm also not such a fan of the "O" in OCaml.)

> All possible runs of a program is exactly what HM type systems type check for.

Yes, my point was this can be a more difficult goal to achieve.

> This fed into the coding model automatically iterates until it finds a solution that doesn't violate any possible run of the program.

Only if the model is able to make progress effectively. I have some amusing transcripts of the opposite situation.

chrischen•1mo ago

I also try to do verbose type classes using Ocaml's module system and it's been handling these patterns pretty well. My guess is there is probably good documentation / training data in there for these patterns since they are well documented. I haven't actually used coding agents with Haskell yet so it's possible that Ocaml's verbosity helps the agent.

wintermutestwin•2mo ago

Can someone please explain why these token guessing models aren't being combined with logic "filters?"

I remember when computers were lauded for being precise tools.

sswatson•2mo ago

1. Because no one knows how to do it. 2. Consider (a) a tool that can apply precise methods when they exist, and (b) a tool that can do that and can also imperfectly solve problems that lack precise solutions. Which is more powerful?

mrguyorama•2mo ago

Intellij knows my Frob class does not have a static Blurb method, yet will still allow an LLM to generate a code completion of "frob.blurb()"

It's insanity. This one stupid issue has cost me significant productivity. I got so much benefit from being able to hit "Tab" every few lines, but now I instead have to press whatever button combos or interactions cause the suggestion to go away, and then type what would have been suggested previously.

We had really good code completion that never made this kind of mistake for 20 years. Apparently we are going to throw that all away because """AI"""?

Just utter fucking insanity.

someguy101010•2mo ago

wrote about this a bit too in https://www.robw.fyi/2025/10/24/simple-control-flow-for-auto...

ran into this when writing agents to fix unit tests. often times they would just give up early so i started writing the verifiers directly into the agent's control flow and this produced much more reliable results. i believe claude code has hooks that do something similar as well.

Dwedit•2mo ago

"Don’t ask an LLM if a URL is valid. It will hallucinate a 200 OK. Run requests.get()."

Except for sites that block any user agent associated with an AI company.

lxgr•2mo ago

You can always run the GET from your own infrastructure.

liampulles•2mo ago

The problem with these agent loops is that their text output is manipulated to then be fed back in as text input, to try and get a reasoning loop that looks something like "thinking".

But our human brains do not work like that. You don't reason via your inner monologue (indeed there are fully functional people with barely any inner monologue), your inner monologue is a projection of thoughts you've already had.

And unfortunately, we have no choice but to use the text input and output of these layers to build agent loops, because trying to build it any other way would be totally incomprehensible (because the meaning of the outputs of middle layers are a mystery). So the only option is an agent which is concerned with self-persuasion (talking to itself).

tasuki•2mo ago

I wish we didn't use LLMs to create test code. Tests should be the only thing written by a human. Let the AI handle the implementation so they pass!

lxgr•2mo ago

Humans writing tests can only help against some subset of all problems that can happen with incompetent or misaligned LLMs. For example, they can game human-written and LLM-written tests just the same.

tasuki•1mo ago

Not property-based tests. Either way, the human is there to tell the machine what to do: tests are one way of expressing that.

jacquesm•2mo ago

LLMs are text model, not world models and that is the root cause of the problem. If you and I would be discussing furniture and for some reason you had assumed the furniture to be glued to the ceiling instead of standing on the floor (contrived example) then it would most likely only take one correction based on your actual experience that you are probably on the wrong track. An LLM will happily re-introduce that error a few ping-pongs later and re-establish the track it was on before because that apparently is some kind of attractor.

Not having a world model is a massive disadvantage when dealing with facts, the facts are supposed to re-inforce each other, if you allow even a single fact that is nonsense then you can very confidently deviate into what at best would be misguided science fiction, and at worst is going to end up being used as a basis to build an edifice on that simply has no support.

Facts are contagious: they work just like foundation stones, if you allow incorrect facts to become a part of your foundation you will be producing nonsense. This is my main gripe with AI and it is - funny enough - also my main gripe with some mass human activities.

coldtea•2mo ago

>LLMs are text model, not world models and that is the root cause of the problem.

Is it though? In the end, the information in the training texts is a distilled proxy for the world, and the weighted model ends up being a world model, just an once-removed one.

Text is not that different to visual information in that regard (and humans base their world model on both).

>Not having a world model is a massive disadvantage when dealing with facts, the facts are supposed to re-inforce each other, if you allow even a single fact that is nonsense then you can very confidently deviate into what at best would be misguided science fiction, and at worst is going to end up being used as a basis to build an edifice on that simply has no support.

Regular humans believe all kinds of facts that are nonsense, many others that are wrong, and quite a few that are even counter to logic too.

And short of omnipresense and omniscience, directly examining the whole world, any world model (human or AI), is built on sets of facts many of which might not be true or valid to begin with.

jacquesm•2mo ago

I really think it is, this is the exact same thing that keeps going wrong in these conversations over-and-over again. There simply is no common sense, none at all, just a likelihood of applicability. To the point that I even wonder how it is possible to get such basic stuff for which there is an insane amount of support wrong.

I've had an hour long session which essentially revolved around why the landing gear of an aircraft is at the bottom, not at the top of the vehicle (paraphrased for good reasons but it was really that basic). And this happened not just once, but multiple times. Confident declarations followed by absolute nonsense, I've even had - I think it was ChatGPT - try to gaslight me with something along the lines of 'you yourself said' on something that I did not say (this is probably the most person like thing I've seen it do).

pessimizer•2mo ago

People have an actual world model, though, that they have to deal with in order to get the food into their mouths or to hit the toilet properly.

The "facts" that they believe that may be nonsense are part of an abstract world model that is far from their experience, for which they never get proper feedback (such as the political situation in Bhutan, or how their best friend is feeling.) In those, it isn't surprising that they perform like an LLM, because they're extracting all of the information from language that they've ingested.

Interestingly, the feedback that people use to adjust the language-extracted portions of their world models is how demonstrating their understanding of those models seems to please or displease the people around them, who in turn respond in physically confirmable ways. What irritates people about simpering LLMs is that they're not doing this properly. They should be testing their knowledge with us (especially their knowledge of our intentions or goals), and have some fear of failure. They have no fear and take no risk; they're stateless and empty.

Human abstractions are based in the reality of the physical responses of the people around them. The facts of those responses are true and valid results of the articulation of these abstractions. The content is irrelevant; when there's no opportunity to act, we're just acting as carriers.

jacquesm•2mo ago

> Human abstractions are based in the reality of the physical responses of the people around them.

And in the physical responses of the world around them. That empiricism is the foundation of all of science and if you throw that out the end result is gibberish.

mrguyorama•2mo ago

The physical responses of the world around them after you have yanked the concept outside of the human brain

We have to blind medical professionals during science because even thoroughly trained and experienced professionals are still more likely to form conclusions and opinions based on understood human biases than reality.

You can take a gambling addict and teach them as much statistics and probability as you want, and even if they demonstrably learned it, they will still go back to the slots and believe a hit is "due" because the link between reality and the brain's construction of its internal models is extremely limited, and those models only inform the brains processes, not necessarily constrain it.

I will never understand however how some people think that an LLM can pull a signal out of it's training material that doesn't actually exist in its training material.

It's like training an LLM on monopoly games and expecting it to be good at chess. What?

mrguyorama•2mo ago

>In the end, the information in the training texts is a distilled proxy for the world

This is routinely asserted. How has it been proven?

Humans write all sorts of text that has zero connection to reality, even when they are ostensibly writing about reality.

Training on ancient greek philosophy which was expressly written to distill knowledge about the real world would produce a stupid LLM that doesn't know about the real world, because the training text was itself wrong about the underlying world.

Also, if LLMs were able to extract underlying truth from training material, why can't they do math very well? It would be easy to train an LLM on only correct math, and indeed you could generate any size corpus of provably correct math you want. I assume someone somewhere has demonstrated success training a neural network on math and having it regenerate something like "addition" or whatever, but how well would such a process survive if a large fraction of it's training material was instead just incorrect math?

The training text is nothing more than human generated text, and asserting anything about that more concrete than "Humans consider this text good enough to be worth writing" is fallacious.

This even applies if your training corpus is, for example, only physics scientific papers that have been strongly replicated and are likely "true". Unless the LLM is also trained on the data itself, the only information available is what the humans thought and wrote. There's no definite link between that and actual reality, which is why physics accepted an "Aether" for so long. The data we had up to that point aligned with our incorrect models. You could not disambiguate between the wrong Aetheric models and a better model with the data we had, and that would remain true of text written about the data.

Humans suck at distilling fact out of reality despite our direct connection to it for all sorts of fun reasons you can read about in psychology, but if you disconnect a human from reality, it only gets worse.

Why would you believe LLMs could possibly be different? A model trained on bad data cannot magically figure out which data is bad.

jacquesm•1mo ago

I think a key insight from your comment is that in order to be able to verify whether the stuff we allow into our brains gets permanent billing we test it against our world model and if it does not fit we reject it. LLMs accept anything in the training set so curation of the training set is a big factor in the quality of the LLMs output. That's an incremental improvement, not a massive leap forward but it definitely will help to reduce the percentage of bullshit created.

lubujackson•2mo ago

The "world model" is what we often refer to as the "context". But it is hard to anticipate bad assumptions that seem obvious because of our existing world model. One of the first bugs I scanned past from LLM generated code was something like:

if user.id == "id": ...

Not anticipating that it would arbitrarily put quotes around a variable name. Other time it will do all kinds of smart logic, generate data with ids then fail to use those ids for lookups, or something equally obvious.

The problem is LLMs guess so much correctly that it is near impossible to understand how or why they might go wrong. We can solve this with heavy validation, iterative testing, etc. But the guardrails we need to actually make the results bulletproof need to go far beyond normal testing. LLMs can make such fundamental mistakes while easily completing complex tasks that we need to reset our expectations for what "idiot proofing" really looks like.

jacquesm•2mo ago

> The "world model" is what we often refer to as the "context".

No, we often do not, and when we do that's just plain wrong.

kangs•2mo ago

it's actually just trust but verify type stuff:

- verifying isn't asking "is it correct?" - verifying is "run requests.get, does it return blah or no?'

just like with humans but usually for different reasons and with slightly different types of failures.

The interesting part perhaps, is that verifying pretty much always involves code, and code is great pre-compacted context for humans and machines alike. Ever tried to get LLM to do a visual thing? why is the couch at the wrong spot with a weird color?

if you make the LLM write a program that generate the image (eg game engine picture, or 3d render), you can enforce the rules by code it can also make for you - now the couch color uses an hex code and its placed at the right coordinates, every time.

mfalcon•2mo ago

I had been working on NLP, NLU mostly, some years before LLMs. I've tried the universal sentence encoder alongside many ML "techniques" in order to understand user intentions and extract entities from text.

The first time I tried chatgpt that was the thing that surprised me most, the way it understood my queries.

I think that the spotlight is on the "generative" side of this technology and we're not giving the query understanding the deserved credit. I'm also not sure we're fully taking advantage of this funcionality.

ivansavz•2mo ago

Yes, I was (and still am) similarly impressed with LLMs ability to understand the intent of my queries and requests.

I've tried several times to understand the "multi-head attention" mechanism that powers this understanding, but I'm yet to build a deep intuition.

Is there any research or expository papers that talk about this "understanding" aspect specifically? How could we measure understand without generation? Are there benchmarks out there specifically designed to test deep/nuanced understanding skills?

Any pointers or recommended reading would be much appreciated.

user3939382•2mo ago

My company is working on fixing these problems. I’ll post a sick HN post eventually if I don’t get stuck in a research tarpit. So far so good.

amarant•2mo ago

I dunno man, if you see response code 404 and start looking into network errors, you need to read up on http response codes. there is no way a network error results in a 404

mmaunder•2mo ago

This is why TDD is how you want to do AI dev. The more tests and test gates, the better. Include profiling in your standard run. Add telemetry like it’s going out of fashion. Teach it how to use the tools in AGENTS.md. And watch the output. Tests. Observability. Gates. Have a non negotiable connection with reality.

Mockapapella•2mo ago

I wrote about something like this a couple months ago: https://thelisowe.substack.com/p/relentless-vibe-coding-part.... Even started building a little library to prove out the concept: https://github.com/Mockapapella/containment-chamber

Spoiler: there won't be a part 2, or if there is it will be with a different approach. I wrote a followup that summarizes my experiences trying this out in the real world on larger codebases: https://thelisowe.substack.com/p/reflections-on-relentless-v...

tl;dr I use a version of it in my codebases now, but the combination of LLM reward hacking and the long tail of verfiers in a language (some of which don't even exist! Like accurately detecting dead code in Python (vulture et. al can't reliably do this) or valid signatures for property-based tests) make this problem more complicated than it seems on the surface. It's not intractable, but you'd be writing many different language-specific libraries. And even then, with all of those verifiers in place, there's no guarantee that when working in different sized repos it will produce a consistent quality of code.

wordpad•2mo ago

How are vibe coding platforms solving this?

Mockapapella•2mo ago

As far as I can tell they aren't

psunavy03•2mo ago

Ironic considering how many LLMs are competing to be trained on Reddit . . . which is the biggest repository of confidently incorrect people on the entire Internet. And I'm not even talking politics.

I've lost count of how much stuff I've seen there related to things I can credibly professionally or personally speak to that is absolute, unadulterated misinformation and bullshit. And this is now LLM training data.

brandon272•2mo ago

One thing I've had to explain to many confused friends who use reddit is that many of the people presenting themselves as domain experts in subreddits related to fields like law, accounting, plumbing, electrical, construction, etc. have absolutely no connection to or experience in whatever the field is.

psunavy03•2mo ago

I had a co-worker talk once about how awesome Reddit was and how much life advice she'd taken from it and I was just like . . . yeah . . .

billy99k•2mo ago

You mean like the war on drugs?

tech_ken•2mo ago

Basic rule of MLE is to have guardrails on your model output; you don't want some high-leverage training data point to trigger problems in prob. These guardrails should be deterministic and separate from the inference system, and basically a stack of user-defined policies. LLMs are ultimately just interpolated surfaces and the rules are the same as if it were LOESS.

asimovfan•2mo ago

we're inching towards the three laws of robotics

optimalsolver•2mo ago

Your username makes me think you might be a little biased.

pontifier•2mo ago

What if we just aren't doing enough, and we need to use GAN techniques with the LLMs.

We're at the "lol, ai cant draw hands right" stage with these hallucinations, but wait a couple years.

rglover•2mo ago

I think this is for the best. Let the "confident idiot" types briefly entertain the idea of competency, hit the inevitable wall, and go away for good. It will take a few years, lots of mistakes, and billions (if not trillions) wasted, but those people will drift back to the mean or lower when they realize ChatGPT isn't the ghost of Robin Leach.

storus•2mo ago

Another article that wants to impose something on a tech we don't really understand and that works the way it works by some happy accident. Instead of pushing the tech as far as we can, learning how to utilize it and what its limitations are to be aware of, some people just want to enforce a set of rules this tech can't satisfy and which would degrade its performance. EU bureaucratic way, let's regulate ascent industry we don't understand and throw the baby out with the bathwater in the process. It's known that autoregressive LLMs are soft-bullshitters, yet they are already enormously useful. They just won't 100% automate cognition.

gloosx•2mo ago

>We are trying to fix probability with more probability. That is a losing game.

>We need to re-introduce Determinism into the stack.

>If it fails lets inject more prompts but call it "rules" and run the magic box again

Bravo.

_alternator_•2mo ago

This comment will probably get buried because I’m late to the party, but I’d like to point out that while they identify a real problem, the author’s approach—using code or ASTs to validate LLM output—does not solve it.

Yes, the approach can certainly detect (some) LLM errors, but it does not provide a feasible method to generate responses that don’t have the errors. You can see at the end that the proposed solution is to automatically update the prompt with a new rule, which is precisely the kind of “vibe check” that LLMs frequently ignore. If they didn’t, you could just write a prompt that says “don’t make any mistakes” and be done with it.

You can certainly use this approach to do some RL on LLM code output, but it’s not going to guarantee correctness. The core problem is that LLMs do next-token prediction and it’s extremely challenging to enforce complex rules like “generate valid code” a priori.

As a closing comment, it seems like I’m seeing a lot of technical half-baked stuff related to LLMs these days because LLMs are good at supporting people when they have half baked ideas, and are reluctant to openly point out the obvious flaws.

Animats•2mo ago

The proposed solution only works for answers where objective validation is easy. That's a start, but it's not going to make a big dent in the hallucination problem.

bigbuppo•2mo ago

I guess that's my problem with AI. While I'm an idiot, I'm a nervous idiot, so it just doesn't work for me.

zcw100•2mo ago

I wrote this blog post on a similar idea likening LLM's to glue guns. Versatile sure but better for keeping the rest of the pieces together then building the entire thing out of it. https://medium.com/p/d3ef3960dc83

mrobot•2mo ago

I Have Been Battling This Problem Since Long Before AI Chatbots

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

The Waymo World Model

What Is Ruliology?

How we made geo joins 400× faster with H3 indexes

Jeffrey Snover: "Welcome to the Room"

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Show HN: I spent 4 years building a UI design tool with only the features I use

Sheldon Brown's Bicycle Technical Info

Microsoft open-sources LiteBox, a security-focused library OS

Hackers (1995) Animated Experience

Show HN: If you lose your memory, how to regain access to your computer?

An Update on Heroku

Dark Alley Mathematics

Vocal Guide – belt sing without killing yourself

Delimited Continuations vs. Lwt for Threads

PC Floppy Copy Protection: Vault Prolok

Was Benoit Mandelbrot a hedgehog or a fox?

How to effectively write quality code with AI

Introducing the Developer Knowledge API and MCP Server

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

Understanding Neural Network, Visually

I now assume that all ads on Apple news are scams

Why I Joined OpenAI

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

Female Asian Elephant Calf Born at the Smithsonian National Zoo

Learning from context is harder than we thought

FORTH? Really!?

WebView performance significantly slower than PWA

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

The Waymo World Model

What Is Ruliology?

How we made geo joins 400× faster with H3 indexes

Jeffrey Snover: "Welcome to the Room"

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Show HN: I spent 4 years building a UI design tool with only the features I use

Sheldon Brown's Bicycle Technical Info

Microsoft open-sources LiteBox, a security-focused library OS

Hackers (1995) Animated Experience

Show HN: If you lose your memory, how to regain access to your computer?

An Update on Heroku

Dark Alley Mathematics

Vocal Guide – belt sing without killing yourself

Delimited Continuations vs. Lwt for Threads

PC Floppy Copy Protection: Vault Prolok

Was Benoit Mandelbrot a hedgehog or a fox?

How to effectively write quality code with AI

Introducing the Developer Knowledge API and MCP Server

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

Understanding Neural Network, Visually

I now assume that all ads on Apple news are scams

Why I Joined OpenAI

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

Female Asian Elephant Calf Born at the Smithsonian National Zoo

Learning from context is harder than we thought

FORTH? Really!?

WebView performance significantly slower than PWA

The "confident idiot" problem: Why AI needs hard rules, not vibe checks

Comments