LLMs bring new nature of abstraction – up and sideways

https://martinfowler.com/articles/2025-nature-abstraction.html

54•tudorizer•7mo ago

Comments

oytis•7mo ago

I don't get his argument, and if it wasn't Martin Fowler I would just dismiss it. He admits himself that it's not an abstraction over previous activity as it was with HLLs, but rather a new activity altogether - that is prompting LLMs for non-deterministic outputs.

Even if we assume there is value in it, why should it replace (even if in part) the previous activity of reliably making computers do exactly what we want?

dist-epoch•7mo ago

Because unreliably solving a harder problem with LLMs is much more valuable than reliably solving an easier problem without.

darkwater•7mo ago

Which harder problems are LLMs going to (unreliably) solve in your opinion?

dist-epoch•7mo ago

Anything which requires "common sense".

A contrived example: there are only 100 MB of disk space left, but 1 GB of logs to write. LLM discards 900 MB of logs and keeps only the most important lines.

Sure, you can nitpick this example, but it's the kind of edge case handling that LLMs can "do something resonable" that before required hard coding and special casing.

sarchertech•7mo ago

In that example something simple like log the errors, or log the first error of the same type per 5 minute block had some percent chance of solving 100% of the problem.

And it’s not just this specific problem. I don’t think letting an LLM handle edge cases is really ever an appropriate use case in production.

I’d much rather the system just fail so that someone will fix it. Imagine a world where at every level instead of failing and halting, everything error just got bubbled up to an LLM that tried to do something reasonable.

Talk about emergent behavior, or more likely catastrophic cascading failures.

I can kind of see your point if you’re talking about a truly hopeless scenario. Like some imaginary autonomous spacecraft that is going to crash into the sun, so in a last ditch effort the autopilot turns over the controls to an LLM.

But even in that scenario we have to have some way of knowing that we truly are in a hopeless scenario. Maybe it just appears that way and the LLM makes it worse.

Or maybe the LLM decides to pilot it into another spacecraft to reduce velocity.

My point is there aren’t many scenarios where “do something reasonable 90% of the time, but do something insane the other 10% of the time” is better than do nothing.

I’ve been using LLMs at work and my gut feeling saying I’m getting some productivity boost, but I’m not even certain of that because I have also spent time chasing subtle bugs that I wouldn’t have introduced myself. I think I’m going to need to see the results of some large well designed studies and several years of output before I really feel confident saying one way or the other.

oytis•7mo ago

OK, so we are having two classes of problems here - ones worth solving unreliably, and ones that are better solved without LLMs. Doesn't sound like a next level of abstraction to me

dist-epoch•7mo ago

I was thinking more along this line: you can solve unreliably 100% of the problem with LLMs, or solve reliably only 80% of the problem.

So you trade reliability to get to that extra 20% of hard cases.

pydry•7mo ago

The story of programming is not largely one of humans striving to be more reliable when programming but putting up better defenses against our own inherent unreliabilities.

When I watch juniors struggle they seem to think that it's because they dont think hard enough whereas it's usually because they didnt build enough infrastructure that would prevent them from needing to think too hard.

As it happens, when it comes to programming, LLM unreliabilities seem to align quite closely with ours so the same guardrails that protect against human programmers' tendencies to fuck up (mostly tests and types) work pretty well for LLMs too.

furyofantares•7mo ago

I'm pretty deep into these things and have never had them solve a harder problem than I can solve. They just solve problems I can solve much, much faster.

Maybe that does add up to solving harder higher level real world problems (business problems) from a practical standpoint, perhaps that's what you mean rather than technical problems.

Or maybe you're referring to producing software which utilizes LLMs, rather than using LLMs to program software (which is what I think the blog post is about, but we should certainly discuss both.)

dist-epoch•7mo ago

> solve a harder problem than I can solve

If you've never done web-dev, and want to create an web-app, where does that fall? In principle you could learn web-dev in 1 week/month, so technically you could do it.

> maybe you're referring to producing software which utilizes LLMs

but yes, this is what I meant, outsourcing "business logic" to an LLM instead of trying to express it in code.

kookamamie•7mo ago

Funny, I dismiss the opinion based on the author in question.

Insanity•7mo ago

Serious question - why? I know of the author but don’t see a reason to value his opinion on this topic more or less because of this.

(Attaching too much value to the person instead of the argument is more of an ‘argument from authority’)

kookamamie•7mo ago

Let's just say I think a lot of damage was caused by their OOP evangelism back in the day.

diggan•7mo ago

You don't think the damage was done by the people who religiously follow whatever loudmouths says? Those are the people I'd stop listening to, rather than ignoring what an educator says when sharing their perspective.

Don't get me wrong, I feel like Fowler is wrong about some things too, and wouldn't follow what he says as dogma, but I don't think I'd attribute companies going after the latest fad as his fault.

kookamamie•7mo ago

Perhaps. Then again, advocating things like Singleton as anything beyond a gloriefied global variable is pretty high on my BS list.

An example: https://martinfowler.com/bliki/StaticSubstitution.html

diggan•7mo ago

> gloriefied global variable is pretty high on my BS list

Say you have a test that is asserting the output of some code, and that code is using a global variable of some kind, how do you ensure you can have tests that are using different values for that global variable and it all works? You'd need to be able to change it during tests somehow.

Personally, I think a lot of the annoying parts of programming go away when you use a more expressive language (like Clojure), including this one. But for other languages, you might need to work around the limitations of the language and then approaches like using Singletons might make more sense.

At the same time, Fowlers perspective is pretty much always in the context of "I have this piece of already written code I need to make slightly better", obviously the easy way is to not have global variables in the first place, but when working with legacy code you do stumble upon one or three non-optimal conditions.

DonHopkins•7mo ago

So all global and static variables without exception are bad, therefore calling them singletons and documenting when and how to use and not use them is bad, huh?

Do you really believe that nobody would be using global variables if it weren't for Martin Fowler? Do you never use them?

alganet•7mo ago

You need to understand that Mr. Fowler works for a consultancy.

LLMs sound great for consultants. A messy hyped technology that you can charge to pretend to fix? Jackpot.

All things these consultancies eventually promote are learnings they had with their own clients.

The OOP patterns he described in the past likely came from observing real developers while being in this consultant role, and _trying_ to document how they overcame typical problems of the time.

I have a feeling that the real people with skin on the game (not consultants) that came up with that stuff would describe it in much simpler terms.

Similarly, it is likely that some of these posts are based on real experience but "consultancified" (made vague and more complex than it needs to be).

dcminter•7mo ago

I'm a bit too lazy to check, but didn't he leave thoughtworks?

Apropos of nothing I saw him speak once at a corporate shindig and I didn't get the impression that he enjoyed it very much. Some of the engineering management were being super weird about him being a (very niche) famous person too...

alganet•7mo ago

https://martinfowler.com/aboutMe.html

> [...] I work for Thoughtworks [...]

> [...] I don't come up with original ideas, but do a pretty good job of recognizing and packaging the ideas of others [...]

> [...] I see my main role as helping my colleagues to capture and promulgate what we've learned about software development to help our profession improve. We've always believed that this openness helps us find clients, recruit the best people, and help our clients succeed. [...]

So, we should read him as such. It's a consultant, trying to capture what successful teams do. Sometimes succeeding, sometimes failing.

dcminter•7mo ago

Yeah, seems I was misremembering - looks like he just doesn't do talks any more.

Disposal8433•7mo ago

His Refactoring book was a good thing at the time. But it ends there, he should have tried to program instead of writing all the other books that made no sense.

felineflock•7mo ago

It is a new nature of abstraction, not a new level.

UP: It lets us state intent in plain language, specs, or examples. We can ask the model to invent code, tests, docs, diagrams—tasks that previously needed human translation from intention to syntax.

BUT SIDEWAYS: Generation is a probability distribution over tokens. Outputs vary with sampling temperature, seed, context length, and even with identical prompts.

dcminter•7mo ago

Surely given an identical prompt with a clean context and the same seed the outputs will not vary?

diggan•7mo ago

+ temperature=0.0 would be needed for reproducible outputs. And even with that, if it's actually reproducible or not depends on the model/weights themselves, not all of them are even when all those things are static. And then finally depends on the implementation of the model architecture as well.

I think the tricky part is that we tend to think that prompts with similar semantic meaning will give the same outputs (like a human), while LLMs can give vastly different outputs if you have one spelling mistake for example, or used "!" instead of "?", the effect varies greatly per model.

dcminter•7mo ago

Hmm, I'm barely even a dabbler, but I'd assumed that the seed in question drove the (pseudo)randomness inherent in "temperature" - if not, what seed(s) do they use and why could one not set that/those too?

To your second part I wouldn't make that assumption - I can see how a non-technical person might, but surely programmers wouldn't? I've certainly produced very different output from that which I intended in boring old C with a mis-placed semi-colon after all!

diggan•7mo ago

> Hmm, I'm barely even a dabbler, but I'd assumed that the seed in question drove the (pseudo)randomness inherent in "temperature" - if not, what seed(s) do they use and why could one not set that/those too?

Implementations and architectures are different enough that it's hard to say "It's like X" in all cases. Last time I tried to achieve 100% reproducible outputs, which obviously includes hard-coding various seeds, I remember not getting reproducible outputs unless setting temperature to 0, I think this was with Qwen2 or Qwq used via Huggingface's Transformers library, but cannot find the exact details now.

Then in other cases, like the hosted OpenAI models, they straight up say "temperature to 0 makes them mostly deterministic", but I'm not exactly sure why they are unable to offer endpoints with determinism.

> I can see how a non-technical person might, but surely programmers wouldn't?

When talking even with developers about prompting and LLMs, there is still quite a few people who are surprised that "You are a helpful assistant." would lead to different outputs than "You are a helpful assistant!". I think if you're a programmer or not matters less, more about understanding how the LLMs actually work in order to understand that.

dcminter•7mo ago

Oh, well that's super interesting, thanks; I guess some side effect of the high degree of parallelism? Anyway, I guess I need to do a bit more than dabble.

> I think if you're a programmer or not matters less, more about understanding how the LLMs actually work in order to understand that.

Sounds like I need to understand them better then as I merely had different misaprehensions than those. More reading for me...

smokel•7mo ago

> I think the tricky part is that we tend to think that prompts with similar semantic meaning will give the same outputs (like a human)

Trust me, this response would have been totally different if I were in a different mood.

furyofantares•7mo ago

You can make these things deterministic for sure, and so you could also store prompts plus model details instead of code if you really wanted to. Lots of reasons this would be a very very poor choice but you could do it.

I don't think that's how you should think about these things being non-deterministic though.

Let's call that technical determinism, and then introduce a separate concept, practical determinism.

What I'm calling practical determinism is your ability as the author to predict (determine) the results. Two different prompts that mean the same thing to me will give different results, and my ability to reason about the results from changes to my prompt is fuzzy. I can have a rough idea, I can gain skill in this area, but I can't gain anything like the same precision as I have reasoning about the results of code I author.

genidoi•7mo ago

This is too abstract and a concrete example of what this looks like in output is needed.

dtagames•7mo ago

I respect Martin Fowler greatly but those who, by their own admission, have not used current AI coding tools really don't have much to add regarding how they affect our work as developers.

I do hope he takes the time to get good with them!

diggan•7mo ago

> have not used current AI coding tools really don't have much to add regarding how they affect our work as developers

I dunno, sometimes it's helpful to learn about the perspectives of people who've watched something from afar as well, especially if they already have broad knowledge and context that is adjacent to the topic itself, and have lots of people around them deep in the trenches that they've discussed with.

A bit like historians still can provide valuable commentary on wars, even though they (probably) haven't participated in the wars themselves.

TZubiri•7mo ago

I agree, I don't use coding tools, "except to ask for a script to chatgpt every once in a while". But I experience it by reviewing and detecting LLM generated code by consultants and juniors. It's easy to ask them for the prompts for example, but when they use autocompletion based LLMs, it's really hard to distinguish source from target code.

nmaley•7mo ago

I'm in the process of actually building LLM based apps at the moment, and Martin Fowler's comments are on the money. The fact is seemingly insignificant changes to prompts can yield dramatically different outcomes, and the odd new outcomes have all these unpredictable downstream impacts. After working with deterministic systems most of my career it requires a different mindset.

It's also a huge barrier to adoption by mainstream businesses, which are used to working to unambiguous business rules. If it's tricky for us developers it's even more frustrating to end users. Very often they end up just saying, f* it, this is too hard.

I also use LLM's to write code and for that they are a huge productivity boon. Just remember to test! But I'm noticing that use of LLM's in mainstream business applications lags the hype quite a bit. They are touted as panaceas, but like any IT technology they are tricky to implement. People always underestimate the effort necessary to get a real return, even with deterministic apps. With indeterministic apps it's an even bigger problem.

CraigJPerry•7mo ago

Some failure modes can be annoying to test for. For example, if you exceed the model’s context window, nothing will happen in terms of errors or exceptions but the observable performance on the task will tank.

Counting tokens is the only reliable defence i found to this.

danielbln•7mo ago

If you exceed the context window the remote LLM endpoint will throw you an error which you probably want to catch, or rather you want to catch that before it happens and deal with it. Either way, it's not a silent error that goes unnoticed usually, what makes you think that?

CraigJPerry•7mo ago

Interesting, the completion return object is documented but theres no error or exception field. In practice the only errors ive seen so far have been on the HTTP transport layer.

It would make sense to me for the chat context to raise an exception. Maybe i should read the docs further…

diggan•7mo ago

> If you exceed the context window the remote LLM endpoint will throw you an error which you probably want to catch

Not every endpoint works the same way, I'm pretty sure LM Studio's OpenAI-compatible endpoints will silently (from the clients perspective) truncate the context, rather than throw an error. It's up to the client to make sure the context fits in those cases.

OpenAI's own endpoints do show an error and refuses if you exceed the context length though. I think I've seen others use the "finish_reason" attribute too to signal the context length was exceeded, rather than setting an error status code on the response.

Overall, even "OpenAI-compatible" endpoints often aren't 100% faithful reproductions of the OpenAI endpoints, sadly.

danielbln•7mo ago

That seems like terrible API design to just truncate without telling the caller. Anthropic, Google and OpenAI all will fail very loudly if you exceed the context window, and that's how it should be. But fair enough, this shouldn't happen anyway and the context should be actively handled before it blows up either way.

dist-epoch•7mo ago

It's complicated, for example some models (o3) will throw an error if you set temperature.

What do you do if you want to support multiple models in your LLM gateway? Do you throw an error if a user sets temperature for o3, thus dumping the problem on them? Or just ignore it, but potentially creating confusion because temperature will seem to not work for some models?

danielbln•7mo ago

I'm a big fan of fail early and fail loudly.

DonHopkins•7mo ago

Me to, and I'm always battling with the LLM's obsession with lazily writing reams of ridiculously defensive code and masking errors in the code it generates and calls, instead of finding the root cause and solving that.

(Yes, I'm referring to the code LLMs generate, not the API for generating code itself, but "fail early and spectacularly" should apply to all code and apis.)

But you have to draw the line at failures that happen in the real world, or in code you can't control. I'm a huge fan of Dave Ackley's "Robust First" computing architecture, and his Moveable Feast Machine.

His "Robust First" philosophy is extremely relevant and has a lot of applications to programming with LLMs, not just hardware design.

Robust First | A conversation with Dave Ackley (T2 Tile Project) | Functionally Imperative Podcast

https://www.youtube.com/watch?v=Qvh1-Dmav34

Robust-first computing: Beyond efficiency

https://www.youtube.com/watch?v=7hwO8Q_TyCA

Bottom up engineering for robust-first computing

https://www.youtube.com/watch?v=y1y2BIAOwAY

Living Computation: Robust-first programming in ULAM

https://www.youtube.com/watch?v=I4flQ8XdvJM

https://news.ycombinator.com/item?id=22304063

DonHopkins on Feb 11, 2020 | parent | context | favorite | on: Growing Neural Cellular Automata: A Differentiable...

Also check out the "Moveable Feast Machine", Robust-first Computing, and this Distributed City Generation example:

https://news.ycombinator.com/item?id=21858577

DonHopkins on Oct 26, 2017 | parent | favorite | on: Cryptography with Cellular Automata (1985) [pdf]

A "Moveable Feast Machine" is a "Robust First" asynchronous distributed fault tolerant cellular-automata-like computer architecture. It's similar to a Cellular Automata, but it different in several important ways, for the sake of "Robust First Computing". These differences give some insight into what CA really are, and what their limitations are.

Cellular Automata are synchronous and deterministic, and can only modify the current cell: all cells are evaluated at once (so the evaluation order doesn't matter), so it's necessary to double buffer the "before" and "after" cells, and the rule can only change the value of the current (center) cell. Moveable Feast Machines are like asynchronous non-deterministic cellular automata with large windows that can modify adjacent cells.

Here's a great example with an amazing demo and explanation, and some stuff I posted about it earlier:

https://news.ycombinator.com/item?id=14236973

Robust-first Computing: Distributed City Generation:

https://www.youtube.com/watch?v=XkSXERxucPc

diggan•7mo ago

> That seems like terrible API design to just truncate without telling the caller

Agree, confused me a lot the first time I encountered it.

It would be great if implementations/endpoints could converge, but with OpenAI moving to the Responses API rather than ChatCompletion, yet the rest of the ecosystem seemingly still implementing ChatCompletion with various small differences (like how to do structured outputs), it feels like it's getting further away, not closer...

alganet•7mo ago

> This evolution in non-determinism is unprecedented in the history of our profession.

Not actually true. Fuzzing and mutation testing have been here for a while.

diggan•7mo ago

I think the whole context of the article is "program with non-deterministic tools", while non-deterministic fuzzing and mutation testing is kind of isolated to "coming up with test cases", not something you constantly program side-by-side with, or even integrate into the (business-side) of the software project itself. That's how I've used fuzzing and mutation testing in the past at least, maybe others use it differently.

Otherwise yeah, there are a bunch of non-deterministic technologies, processes and workflows missing, like what Machine Learning folks been doing for decades, which is also software and non-deterministic, but also off-topic from context of the article, as I read it.

alganet•7mo ago

I just have a problem with his use of the word "unprecedent".

This is not the first rodeo of our profession with non-determinism.

TZubiri•7mo ago

Right, in testing, but not in the compiler chain

alganet•7mo ago

I don't understand what you mean. Can you elaborate on your perception of what a "compiler chain" is and the supposed LLM role in it?

TZubiri•7mo ago

A C compiler outputs x86 or ARM or whatever assembly. C is the source, x86 is the target code.

Javascript is source code that might be interpreted or might output html target code (by Dom manipulation)

Typescript compiles to javascript.

Now javascript is both source and target code. If you upload javascript code that was generated by ts to your repo and you leave out your ts, that's bad.

Similarly, an LLM has english (or any natural language) as it's source code and typescript (or whatever programming language) as its target code. You shouldn't upload your target code to your repo, and you shouldn't consider it source code.

It's interesting that the compiler in this case is non deterministic, but it doesn't change the fact that the prompts are source code, the vibecode is target code.

I have a repo that showcases this

https://github.com/TZubiri/keyboard-transpositions-checker

alganet•7mo ago

Read his text more carefully:

> I can't just store my prompts in git and know that I'll get the same behavior each time

He's not on this idea of using english as source code. He explicitly acknowledges that it doesn't work that way (although he's vague in what _actually_ would replace this).

In summary, he's not talking about english as source code.

It _could_ be that someone else figures out how to use english as authoritative source, but that's not what he's talking about.

In that sense, he's talking about using LLMs as the IDE, tooling. It's not that different from using mutation testing (not something I would commit to the repo), and I stand by my original statement that this is not "unprecedent" as it seems.

smokel•7mo ago

Abstractions for high-level programming languages have always gone in multiple directions (or dimensions if you will). Operations in higher level languages abstract over multiple simpler operations in other languages, but they also allow for abstraction over human concepts, by introducing variable names for example. Variable names are irrelevant to a computer, but highly relevant to humans.

Languages are created to support both computers as well as humans. And to most humans, abstractions such as those presented by, say, Hibernate annotations, are as non-deterministic as can be. To the computer it is all the same, but that is increasingly becoming less relevant, given that software is growing and has to be maintained by humans.

So, yes, LLMs are interesting, but not necessarily that much of a game-changer when compared to the mess we are already in.

bgwalter•7mo ago

How many bandwagons has this guy jumped on? Now he says that LLMs will be the new high level programming languages but also that he listens to colleagues and hasn't really tried them yet.

I suppose he is aiming for a new book and speaker fees from the LLM industrial complex.

somewhereoutth•7mo ago

> As we learn to use LLMs in our work, we have to figure out how to live with this non-determinism. This change is dramatic, and rather excites me. I'm sure I'll be sad at some things we'll lose, but there will also things we'll gain that few of us understand yet. This evolution in non-determinism is unprecedented in the history of our profession.

The whole point of computers is that they were deterministic, such that any effective method can be automated - leaving humans to do the non-deterministic (and hopefully more fun) stuff.

Why do we want to break this up-to-now hugely successful symbiosis?

stpedgwdgfhgdd•7mo ago

I’m programming software development workflows for Claude in plain english (custom commands). The non-deterministic is indeed a (tiny!) bit of a problem, but just tell Claude to improve the command so next time it won’t make the same mistake. One time it added an implementation section to the command. Pretty cool

This is the big game changer: we have a programming environment where the program can improve itself. That is something Fortran couldn’t do.

stpedgwdgfhgdd•7mo ago

(bit off topic, but I wished someone told me this a few months ago; for regular programming generation, use TDD and a type safe language like Go. (Fast compile times and excellent testing support). Don’t aim for the magical huge waterfall prompt)

dingnuts•7mo ago

If LLMs are the new compilers, enabling software to be built with natural language, why can't LLMs just generate bytecode directly? Why generate HLL code at all?

Uehreka•7mo ago

Why would the ability to generate source code imply the ability to generate bytecode? Also you wouldn’t want that, humans can’t review bytecode. I think you may be taking the metaphor too literally.

pixl97•7mo ago

I dont think they are... LLMs can learn from anything thats been tokenized. Feed enough decompiled and labeled data with the bytecode and it's likely the machine will be able to dump out an executable. I wouldn't be surprised if an llm could output a valid elf right now other than the tokens may have been stripped in pretraining.

bird0861•7mo ago

https://ai.meta.com/research/publications/meta-large-languag...

skydhash•7mo ago

Because the semantic for each term in a programming language is pretty much a 1:1 relation to a sequential and logic-based ordering of terms in bytecode (which are still code).

> Also you wouldn’t want that, humans can’t review bytecode

The one great thing about automation (and formalism) is that you don't have to continuously review it. You vet it once, then you add another mechanism that monitors for wrong output/behavior. And now, the human is free for something else.

VinLucero•7mo ago

I agree here. English (human language) to Bytecode is the future.

With reverse translation as needed.

thfuran•7mo ago

English is a pretty terrible language for describing the precise behavior of a program.

demirbey05•7mo ago

How will you figure out or solve hallucinated assembly code ?

imiric•7mo ago

The vibe coders would tell you: you don't. You test the program, or ask the LLM to write tests for you, and if there are any issues, you ask it to fix them. And you do that in a loop until there are no more issues.

I imagine that at some point they must wonder what their role is, and why the LLM couldn't do all of that independently.

akavi•7mo ago

Same reason humans use high-level languages: limited context windows.

Both humans and LLMs benefit from non-leaky abstractions—they offload low-level details and free up mental or computational bandwidth for higher-order concerns. When, say, implementing a permissioning system for a web app, I can't simultaneously track memory allocation and how my data model choices aligns with product goals. Abstractions let me ignore the former to "spend" my limited intelligence on the latter; same with LLMs and their context limits.

Yes, more intelligence (at least in part) means being able to handle larger contexts, and maybe superintelligent systems could keep everything "in mind." But even then, abstraction likely remains useful in trading depth for surface area. Chris Sawyer was brilliant enough to write Rollercoaster Tycoon in assembly, but probably wouldn't be able to do the same for Elden Ring.

(Also, at least until LLMs are so transcendentally intelligent they outstrip our ability to understand their actions, HLLs are much more verifiable by humans than assembly is. Admittedly, this might be a time-limited concern)

ptx•7mo ago

> As we learn to use LLMs in our work, we have to figure out how to live with this non-determinism [...] but there will also things we'll gain that few of us understand yet.

No thanks. Let's not give up determinism for vague promises of benefits "few of us understand yet".

aradox66•7mo ago

Determinism isn't always ideal. Determinism may trade off with things like accuracy, performance, etc. There are situations where the tradeoff is well worth it.

pixl97•7mo ago

Yep, there are plenty of things that aren't computable without burning all the entropy in the visible universe, yet if you exchange it with a heuristic you can get a good enough answer in polynomial time.

Weather forecasts are a good example of this.

josefx•7mo ago

Most heuristics are still deterministic.

betenoire•7mo ago

I understand there are probabilities and shortcuts in weather forecasts.... but what part is non-deterministic?

aradox66•7mo ago

Also, at temperature 0 LLMs can behave deterministically! Indeterminism isn't necessarily quite the right word for the kind of abstraction LLMs provide

josefx•7mo ago

That runs into the issue that nobody runs LLMs with a temperature of zero.

bird0861•7mo ago

Not true. Perhaps very few do, but some do in fact run them at 0. I've done it myself. There are many small models that will gladly perform well in QA with temp 0. Of course there are few situations where this is the recommended setup -- we all know RAG takes less than a billion parameters now to do effectively. But nevertheless there are people who do this, and there are plausibly some use cases for it.

bird0861•7mo ago

Quite pleased you mentioned this. I would like to add transformer LLMs can be turing complete, see the work of Franz Nowak and his colleagues (I think there were at least one or two other papers by other teams but I read Nowak's the closest as it was the latest one when I became aware of this).

gpm•7mo ago

Even at temperature != 0 it's trivial to just use a fixed seed in the RNG... it's just a computer being used in a naive, not even multi threaded (i.e. with race conditions), way.

I wouldn't be surprised to find out different stacks multiple fp16s slightly differently or something. Getting determinism across machines might take some work... but there's really nothing magic going on here.

billyp-rva•7mo ago

Nobody was stopping anyone from making compilers that introduced random different behavior every time you ran them. I think it's telling this didn't catch on.

gpm•7mo ago

I think there was actually a very big push to stop people from doing that - https://en.wikipedia.org/wiki/Reproducible_builds

There were definitely compilers that used things like data-structures with an unstable iteration order resulting in non-determinism, and people went stopping other people from doing that. This behavior would result in non-deterministic performance everywhere, and combined with race conditions or just undefined behavior other random non-deterministic behaviors too.

At least in part this was achieved with techniques that can be used to make LLMs to, like by seeding RNGs in hash tables deterministically. LLMs are in that sense no less deterministic than iterating over a hash table (they are just a bunch of matrix multiplications with a sampling procedure at the end, after all).

danenania•7mo ago

I think this gets at a major hurdle that needs to be overcome for truly human-level AGI.

Because the human brain is also non-deterministic. If you ask a software engineer the same question on different days, you can easily get different answers.

So I think what we want from LLMs is not determinism, just as that's not really what you'd want from a human. It's more about convergence. Non-determinism is ok, but it shouldn't be all over the map. If you ask the engineer to talk through the best way to solve some problem on Tuesday, then you ask again on Wednesday, you might expect a marginally different answer considering they've had time to think on it, but you'd also expect quite a lot of consistency. If the second answer went in a completely different direction, and there was no clear explanation for why, you'd probably raise an eyebrow.

Similarly, if there really is a single "right" answer to a question, like something fact-based or where best practices are extremely well established, you want convergence around that single answer every time, to the point that you effectively do have determinism in that narrow scope.

LLMs struggle with this. If you ask an LLM to solve the same problem multiple times in code, you're likely to get wildly different approaches each time. Adding more detail and constraints to the prompt helps, but it's definitely an area where LLMs are still far behind humans.

sgt101•7mo ago

LLMs are deterministic.

If you run an LLM with optimization turned on on a NVIDIA GPU then you can get non-deterministic results.

But, this is a choice.

bwfan123•7mo ago

Can authors of such articles at least cite Dijkstra's "On the foolishness of "natural language programming"." which appeared eons ago ? Which presents an argument against the "english is a programming language" hype.

[1] https://www.cs.utexas.edu/~EWD/transcriptions/EWD06xx/EWD667...

awb•7mo ago

Interesting read and thanks for sharing.

Two observations:

1. Natural language appears to be to be the starting point of any endeavor.

> It may be illuminating to try to imagine what would have happened if, right from the start our native tongue would have been the only vehicle for the input into and the output from our information processing equipment. My considered guess is that history would, in a sense, have repeated itself, and that computer science would consist mainly of the indeed black art how to bootstrap from there to a sufficiently well-defined formal system. We would need all the intellect in the world to get the interface narrow enough to be usable, and, in view of the history of mankind, it may not be overly pessimistic to guess that to do the job well enough would require again a few thousand years.

LLMs are trying to replicate all of the intellect in the world.

I’m curious if the author would consider that these lofty caveats may be more plausible today than they were when the text was written.

bwfan123•7mo ago

> I’m curious if the author would consider that these lofty caveats may be more plausible today than they were when the text was written.

What is missed by many and highlighted in the article is the following: that there is no way to be "precise" with natural languages. The "operational definition" of precision involves formalism. For example, I could describe to you in english how an algorithm works, and maybe you understand it. But for you to precisely run that algorithm requires some formal definition of a machine model and steps involved to program it.

The machine model for english is undefined ! and this could be considered a feature and not a bug. ie, It allows a rich world of human meaning to be communicated. Whereas, formalism limits what can be done and communicated in that framework.

skydhash•7mo ago

I forgot where I read it, but the reason that natural languages works so well for communication is because the terms are labels for categories instead of identifiers. You can concatenate enough to refer to a singleton, but for the person in front, it can be many items or an empty set. Some labels may even be nonexistent in their context

So when we want deterministic process, we invent a set of labels where each is a singleton. Alongside them is a set of rules that specify how to describe their transformation. Then we invented machines that can interpret those instructions. The main advantage was that we know the possible outputs (assuming a good reliability) before we even have to act.

LLMs don't work so well in that regard, as while they have a perfect embedding of textual grammar rules, they don't have a good representation for what those labels refers to. All they have are relations between labels and how likely are they used together. But not what are the sets that those labels refer to and how the items in those sets interact.

akavi•7mo ago

> All they have are relations between labels and how likely are they used together. But not what are the sets that those labels refer to and how the items in those sets interact.

Why would "membership in a set" not show up as a relationship between the items and the set?

In fact, it's not obvious to me that there's any semantic meaning not contained in the relationship between labels.

skydhash•7mo ago

Because the training data does not have it. So we have the label "rock" which intersect with some other label like "hard" and "earth". But the item itself have more attributes that we don't bother assigning label to them. Instead, we just experience them. So the label get linked to some qualia. We can assume that there's a collective intersection of the qualia that the label "rock" refers to.

LLMs don't have access to these hidden attributes (think how to describe "blue" to someone born blind). They may understand that color is a property of object, or that "black" is the color you wear for funerals in some locations. But ask them how to describe the color of a specific object and the output is almost guaranteed to be wrong. Unless they are in a funeral in the above location, so he can predict that most people wear black. But it's a guess, not an informed answer.

bwfan123•7mo ago

This is a nice explanation of language, and the world-model that the language is intended to depict. If I understand you correctly, formalism is a kind of language where the world-model (ie, items and actions in the world depicted by the language) leaves no room for doubt.

skydhash•7mo ago

Pretty much and if we take programming languages, where inputs and RNG are not specified by the formal grammar of that language, the programmer needs to split them between good values and bad values. And ideally, halt the program when it detects the latter as the result would be nonsense.

So a program is a more restrictive version of the programming languages, which itself is a more restrictive version of a computer. But the tools to specify those restrictions are not perfect as speed and intuitiveness would suffer greatly (haskell vs python).

akavi•7mo ago

But for most human endeavors, "operational precision" is a useful implementation detail, not a fundamental requirement.

We want software to be operationally precise because it allows us to build up towers of abstractions without needing to worry about leaks (even the leakiest software abstraction is far more watertight than any physical "abstraction").

But, at the level of the team or organization that's _building_ the software, there's no such operational precision. Individuals communicating with each other drop down to such precision when useful, but at any endeavor larger than 2-3 people, the _vast_ majority of communication occurs in purely natural language. And yet, this still generates useful software.

The phase change of LLMs is that they're computers that finally are "smart" enough to engage at this level. This is fundamentally different from the world Dijkstra was living in.

3abiton•7mo ago

On that note, I wonder if having LLM agents communicating with each others in a human language rather than latent space is a big limitation.

roxolotl•7mo ago

One thing I’d add to all of the other comments is just to reflect on experience. Maybe I’ve mostly worked with people who are incompetent with natural language. But assuming that I’ve mostly worked with average people it’s astonishing how common miscommunication is amongst experts when discussing changes to software. I’ve always found the best way to avoid that is to drop into a more structured language. You see this with most communication tools. They add structure to avoid miscommunication.

moregrist•7mo ago

[flagged]

pyman•7mo ago

Who said: Great minds talk about ideas, average minds talk about events, and small minds talk about people?

hamdouni•7mo ago

A wise human

w10-1•7mo ago

> I've not had the opportunity to do more than dabble with the best Gen-AI tools, but I'm fascinated as I listen to friends and colleagues share their experiences. I'm convinced that this is another fundamental change

So: impressions of impressions is the foundation for a declaration of fundamental change?

What exactly is this abstraction? why nature? why new?

RESULT: unfounded, ill-formed expression

agentultra•7mo ago

Abstraction? Hardly.

What are the new semantics and how are the lower levels precisely implemented?

Spoken language isn’t precise enough for programming.

I’m starting to suspect what people are excited about is the automation.

skydhash•7mo ago

But that's not really automation.

It's more search and act based on the first output. You don't know what's going to come out, you just hope it will be good. The issue is that that the query is fed to the output function. So what you get is a mixture of what is a mixture of what you told it and what's was stored. Great if you can separate the two afterwards, not so if the output is tainted by the query.

With automation, what you seek is predictability. Not an echo chamber.

ADDENDUM

If we continue with the echo chamber analogy:

Prompt Engineering: Altering your voice so that the result back is more pleasant

System Prompt: The echo chamber's builders altering the configuration to get the above effects

RAG: Sound effects

Agent: Replace yourself in front of the echo chamber with someone/something that act based on the echo.

abeppu•7mo ago

I think we should shift the focus from adapting LLMs to our purposes (e.g. external tool use) and adapting how we think about software and focus on getting models that internally understand compilation and execution. Rather than merely building around next token prediction, the industry should take advantage of the fact that software in particular provides a cheap path to learning a domain-specific "world model".

Currently I sometimes get predictions where a variable that doesn't exist gets used or a method call doesn't match the signature. The text of the code might look pretty plausible but it's only relatively late that a tool invocation flags that something is wrong.

If instead of just code text, we trained a model on (code text,IR, bytecode) tuples, (byte code, fuzzer inputs, execution trace) examples, and (trace, natural language description) annotations. The model needs to understand not just what token sequences seem likely but (a) what will the code compile to? (b) what does the code _do_ and (c) how would a human describe this behavior? Bonus points for some path to tie in pre/post conditions, invariants, etc

"People need to adapt to weaker abstractions in the LLM era" is a short term coping strategy. Making models that can reason about abstractions in a much tighter loop and higher fidelity loop may get us code generation we can trust.

yuvadam•7mo ago

These types of errors are not only rare in one-shots, but also very easy to fix in subsequent iterations - e.g. Claude Code with Sonnet rarely makes these errors.

perching_aix•7mo ago

> so I can't just store my prompts in git and know that I'll get the same behavior each time.

Yes you can, albeit it's pretty silly to do so. LLMs are not (inherently) nondeterministic, you're just not pinning the seed, or are using a remote, managed service for them with no reliability and consistency guarantees. [0]

Here, experiment with me [1]:

- download Ollama 0.9.3 (current latest)

- download the model gemma3n e4b (digest: 15cb39fd9394) using the command "ollama run gemma3n:e4b"

- pin the seed to some constant; let's use 42 as an example, by issuing the following command in the ollama interactive session started in the previous step: "/set parameter seed 42"

- prompt the model: "were the original macs really exactly 9 inch?"

It will respond with:

> You're right to question that! The original Macintosh (released in 1984) *was not exactly 9 inches*. It was marketed as having a *9-inch CRT display*, but the actual usable screen size was a bit smaller. (...)

Full response here: https://pastebin.com/PvFc4yH7

The response should be the same over time, across all devices, regardless of whether GPU-acceleration is available.

Bit of an aside, but the overall sentiment echoed in the article reminds me to how visual programming was going to revolutionize everything and take programmers' jobs. With the exception that AI I find actually useful, and was able to integrate it into my workflow.

[0] All of this is to say, to the extent LLM nondeterminism is currently a model trait, it is substituted using a PRNG. Actual nondeterminism is at most an inference engine trait typically instead, see e.g. batched inference.

[1] Details are for experiment reproduction purposes. You can substitute the listed inference engine, model, seed, and prompt with whatever your prefer for your own set of experiments.

0points•7mo ago

It's only deterministic with that exact model and that exact seed. So you are SOL unless you have a copy of that model.

perching_aix•7mo ago

And that exact prompt and that exact inference engine [version].

Pretty reasonable if you ask me. All of this was to say, these are still programs, all the regular sensibilities still apply. Heck, even for that, the sensibilities that apply are pretty old-school: in modern, threaded applications, you'd expect runtime behavioral variations. Not the case here. Even for the high-level language compilers referred to in the article, this doesn't apply so easily. The folks over at reproducible builds [0] put in a decent bit of effort to my knowledge to make it happen.

The overarching point being that it's not magic: it's technology. And if you hold them to even broadly similar standards you hold compilers to, they are absolutely deterministic.

In case you mean that if you pick anything else other than what I picked here, the process ceases to be deterministic, that is not true. You can trivially test and confirm that the same way I did.

[0] https://reproducible-builds.org/

saurik•7mo ago

This is really brittle, though, and in a way that matters a lot more than reproducible builds. Like, sure: a slightly different compiler might cause a slightly different bug to be introduced into the code, but here a slight bit of non-determinism leads to a drastically and fundamentally different result! And not only is the exact inference engine version important, but sometimes the exact hardware as well (when run on GPU)...

perching_aix•7mo ago

My pet peeve here is mistaking nondeterminism with model unreliability. The issue is not that these models or inference engines are inherently nondeterministic, they aren't, but that the former are just plain unreliable.

I can ask the example question multiple times, and often it will tell me absolute balooney. This is not the model being nondeterministic, it's the model being not very good. It's inconsistent across what we consider semantically equivalent prompts. This is a gap in alignment, not some sort of computational property, like nondeterminism. If you ask it 4+4 in multiple different ways, it will always(?) reply correctly, but ask it something more complex, and you can get seriously different results.

The same exchange can be played out multiple times again and again, and it will reply with the exact same misunderstandings in the exact same order again and again, byte to byte matching. The kicker here is batched inference being a necessity for economical reasons at scale, and exactly matching outputs being impractical because of their non-formal target usage. So you get these wishy washy back and forths about whether the models are deterministic or not, which is super frustrating and misses the point.

And I have to keep dodging bullets like "well if you swap the hardware from under it it might change" or "if you prompt a remote service where they control what's deployed and don't tell you fully, and batch your prompt together with a bunch of other people's which you have no control over, it will not be reproducible". Yes, it might not be, obviously. Even completely deterministic code can become nondeterministic if you run it on top of an execution environment that makes it so. That's not exactly a reasonable counter I don't think.

With regular software, there's a contract (the specifications) that code is developed against, and variances are constrained within there. This is not the case with LLMs, the variances are letting jesus take over the wheel tier, and that's a perfectly fine retort. But then it's not nondeterminism that's the issue.

0points•7mo ago

> Pretty reasonable if you ask me.

Where can i download ChatGPT models?

> The folks over at reproducible builds

I like those folks (hello hboeck!), but their work are unrelated to determinstic LLM output, so why even bring them up here?

> The overarching point being that it's not magic: it's technology.

Yea are you even responding to me, or is this just a stream of thought?

Noone said LLM is magic.

perching_aix•7mo ago

> Where can i download ChatGPT models?

Nowhere. Does that make LLMs nondeterministic?

> I like those folks (hello hboeck!), but their work are unrelated to determinstic LLM output, so why even bring them up here?

Did you read the blogpost in the OP?

I cannot understand what I wrote for you on your behalf. What could possibly be unclear about this? Or was this another rhetorical like the previous one, to polish your snark?

> Yea are you even responding to me, or is this just a stream of thought?

Yes, I was. Are you able to ask non-rhetorical questions too? Should I have asked for your blessed permission to write more than just a direct address of what you asked about?

0points•7mo ago

>> Where can i download ChatGPT models?

> Nowhere. Does that make LLMs nondeterministic?

It does, yes.

I can not reproduce the same output without access to the model.

Hence, not deterministic.

perching_aix•7mo ago

I see, that's a very entertaining way of reasoning about the determinism of an entire class of software, thanks for sharing.

lufenialif2•7mo ago

I tried recreating this - gemma3n:e4b ID 15cb39fd9394 and /set parameter seed 42 and I got 2 different results on the two times I asked 'were the original macs really exactly 9 inch?'

Strangely, I'm only getting 2 alternating results every time I restart the model. I was not able to get the same result as you and certainly not with links to external sources. Is there anything else I could do to try to replicate your result?

I've only used ChatGPT prior and it'd be nice to use locally run models with consistent results.

perching_aix•7mo ago

Hmm, that's pretty strange. Not sure what might be going wrong, could be terminal shenanigans or a genuine bug in ollama.

Of course, double checking the basics would be a good thing to cover: ollama --version should return ollama 0.9.3, and the prompt should be copied and pasted to ensure it's byte-exactly matching.

Maybe you could also try querying the model through its API (localhost:11434/api/generate)? I'll ask a colleague to try and repro on his Mac like last time just to double check. I also tried restarting the model a few times, worked as expected here.

*Update:* getting some serious reproducibility issues across different hardware. A month ago the same experiment with regular quantized gemma3 worked fine between GPU, CPU, and my colleague's Mac, this time the responses differ everywhere (although they are consistent between resets on the same hw). Seems like this model may be more sensitive to hardware differences? I can try generating you a response with regular gemma3 12b qat if you're interested in comparing that.

lufenialif2•7mo ago

Yeah trying out gemma3 12b qat sounds great.

I got back 0.9.3 as well as copied and pasted the prompt (included quotes and no quotes as well just in case...)

I can try the API as well and I'm using a legion 15ach6 but I could also try on my MacBook Pro.

perching_aix•7mo ago

Okay, this has been a ride.

Reverted to 0.8.0 of ollama, switched to gemma3:12b-it-qat for the model, set the seed to 42 and the temp to 0, and used my old prompt. This way I was able to get consistent results everywhere, and could confirm from old screenshots everything still matches.

Prompt and output here: https://pastebin.com/xUi3bbGh

However, when using the prompt I used previously in this thread, I'm getting a different response between machines, even with the temp and seed pinned. On the same machine, I initially found that it's reliably the same, but after running it a good few times more, I was eventually able to get the flip-flopping behavior you describe.

API wise, I just straight up wasn't able to get consistent results at all, so that was a complete bust.

Ultimately, it seems like I successfully fooled myself in the past and accidentally cherry picked an example? Or at least it's way more brittle than I thought. At this point I'd need significantly more insight into how the inference engine (ollama) works to be able to definitively ascertain whether this is a model or an engine trait, and whether it is essential for the model to work (although I'm still convinced it isn't). Not sure if that helps you much in practice though.

I wouldn't make a good scientist, apparently :)

lufenialif2•7mo ago

I appreciate the effort! And I disagree - that's what it's all about haha

I assume there are more levers we could try pulling to reduce variation? I'll be looking into this as well.

As an aside, because of my own experience with variability using chatGPT (non-API, I assume there are also more levers to pull here), I've been thinking about LLMs and their application to gaming. To what extent it is possible to use LLMs to interpret a result and then return a variable that then executes the usual state updates? This would hopefully add a bit of intentional variability in the game's response to user inputs but consistency in updating internal game logic.

edit: found this! https://github.com/rasbt/LLMs-from-scratch/issues/249 Seems that it's an ongoing issue from various other links I've found, and now when I google "ollama reproducibility" this thread comes up on the first page, so it seems it's an uncommon issue as well :(

gdubs•7mo ago

The Star Trek computer – I grew up with TNG – is the example I always think of. They ask it in human language to pull up some data, create a visualization, run an analysis. You get the sense that people also still write programs in the future in a more manual way – but, done well these LLMs are the building blocks for that more conversational way of getting the computer to do what you want it to do.

A lot of the complaints that come up on Hacker News are around the idea that a piece of code needs to be elegantly crafted "Just so" for a particular purpose. An efficient algorithm, a perfectly correct program. (Which, sorry but – have you seen most of the software in the world?)

And that's all well and good – I like the craft too. I'm proud of some very elegant code I've written.

But, the writing is on the wall – this is another turning point in computing similar to the personal computer. People scoffed at that too. "Why would regular people want a computer? Their programs will be awful!"

UltraSane•7mo ago

a TLA+ model combined with a good LLM to generate code from it is the ultimate level of useful abstraction.

mrheosuper•7mo ago

i have a "server" at home that host proxmox. On it there is 1 VM i used to test non-so-legal software, with no connection to internet or local network of course (hopefully proxmox virtual switch is good).

I could say it's a lab, right ?

rkhalaf•7mo ago

Now that machines can also follow natural language instructions, then programs can become a mix of code and natural language - using each where it makes the most sense.

We looked into how we could blend this by integrating natural language within a programming system here https://blog.ballerina.io/posts/2025-04-26-introducing-natur...

and @bwfan123, we did cite Dijkstra as well as Knuth. I believe sometimes you need the rigor of symbolic notation and other times you need the ambiguity of language - it depends on what you are doing and we can let the user decide how and when to blend the two.

It's not about one or the other, but a gentle and harmonious blending

Do you have a mathematically attractive face?

Code only says what it does

The success of 'natural language programming'

The Scriptovision Super Micro Script video titler is almost a home computer

Discovering the "original" iPhone from 1995 [video]

Psychometric Comparability of LLM-Based Digital Twins

SidePop – track revenue, costs, and overall business health in one place

The Other Markov's Inequality

The Cascading Effects of Repackaged APIs [pdf]

Lightweight and extensible compatibility layer between dataframe libraries

Haskell for all: Beyond agentic coding

Dorsey's Block cutting up to 10% of staff

Show HN: Freenet Lives – Real-Time Decentralized Apps at Scale [video]

In the AI age, 'slow and steady' doesn't win

Administration won't let student deported to Honduras return

How were the NIST ECDSA curve parameters generated? (2023)

AI, networks and Mechanical Turks (2025)

Goto Considered Awesome [video]

Show HN: I Built a Free AI LinkedIn Carousel Generator

Implementing Auto Tiling with Just 5 Tiles

Open Challange (Get all Universities involved

Apple Tried to Tamper Proof AirTag 2 Speakers – I Broke It [video]

Show HN: Isolating AI-generated code from human code | Vibe as a Code

Show HN: More beautiful and usable Hacker News

Toledo Derailment Rescue [video]

War Department Cuts Ties with Harvard University

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

A Bid-Based NFT Advertising Grid

AI readability score for your documentation

NASA Study: Non-Biologic Processes Don't Explain Mars Organics

Do you have a mathematically attractive face?

Code only says what it does

The success of 'natural language programming'

The Scriptovision Super Micro Script video titler is almost a home computer

Discovering the "original" iPhone from 1995 [video]

Psychometric Comparability of LLM-Based Digital Twins

SidePop – track revenue, costs, and overall business health in one place

The Other Markov's Inequality

The Cascading Effects of Repackaged APIs [pdf]

Lightweight and extensible compatibility layer between dataframe libraries

Haskell for all: Beyond agentic coding

Dorsey's Block cutting up to 10% of staff

Show HN: Freenet Lives – Real-Time Decentralized Apps at Scale [video]

In the AI age, 'slow and steady' doesn't win

Administration won't let student deported to Honduras return

How were the NIST ECDSA curve parameters generated? (2023)

AI, networks and Mechanical Turks (2025)

Goto Considered Awesome [video]

Show HN: I Built a Free AI LinkedIn Carousel Generator

Implementing Auto Tiling with Just 5 Tiles

Open Challange (Get all Universities involved

Apple Tried to Tamper Proof AirTag 2 Speakers – I Broke It [video]

Show HN: Isolating AI-generated code from human code | Vibe as a Code

Show HN: More beautiful and usable Hacker News

Toledo Derailment Rescue [video]

War Department Cuts Ties with Harvard University

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

A Bid-Based NFT Advertising Grid

AI readability score for your documentation

NASA Study: Non-Biologic Processes Don't Explain Mars Organics

LLMs bring new nature of abstraction – up and sideways

Comments