Why AI Agents Cannot Change Software Systems

https://phroneses.com/articles/build/notes/agents-cannot-maintain-systems.html

41•jhevans•59m ago

Comments

taintlord223•46m ago

I would simplify to: Why agents cannot meaningfully contribute

fzeindl•42m ago

I was originally sceptical of LLMs and am far from the „agents will magically fix our future“-crowd, but sentences like these trip me up:

> „But pattern‑matching is not system understanding, and plausibility is not correctness.“

Why not? Who says that? Who proved that system understanding is not just more complex pattern matching?

> „LLMs predict tokens, not consequences“

Same here. LLMs output tokens but who says that they don’t form some internal group of token-predicting tensors that move together and constitute the internal model of a „consequence“? It is like saying humans don’t have thoughts, they just have electrical impulses moving their tongues.

I too think that LLMs seem to be a very specific form of intelligence, maybe resembling the parts of our brain that do language-processing, but it is a fact that they at least fake intelligence very convincingly. And that we actually don’t know how they do it.

locknitpicker•33m ago

> Why not? Who says that? Who proved that system understanding is not just more complex pattern matching?

Yes indeed. That's a perplexing statement considering that a central concept or software engineering is architecture patterns.

thesz•18m ago

  > central concept or software engineering is architecture patterns.

Both RUP and PSP/TSP do stand on the ground of defect prevention. All sorts of defects, from incorrect sets of requirements to memory corruption.

Architecture patterns can be of help in that regard and they also can be very error-prone, as right now I am in the process of removing a bug introduced through misunderstanding of one rather old singleton.

pjm331•27m ago

> > "But pattern‑matching is not system understanding, and plausibility is not correctness."

> Why not? Who says that? Who proved that system understanding is not just more complex pattern matching?

I'm not in the camp of "system understanding is just more complex pattern matching"

but I am absolutely in the camp of "there are many tasks where pattern matching is just as effective as actual understanding"

fzeindl•23m ago

> but I am absolutely in the camp of "there are many tasks where pattern matching is just as effective as actual understanding“

What if „being effective at something with pattern matching but not understanding it“ just means that you have identified only 90% of patterns and keep failing to learn the rest for whatever reason.

amelius•24m ago

> Why not? Who says that? Who proved that system understanding is not just more complex pattern matching?

I think the naysayers already decided that the burden of proof is on the other side.

pjc50•15m ago

That is the traditional "null hypothesis", yes.

justincormack•17m ago

The whole post is written by AI anyway, so its not worth engaging with.

jvanderbot•41m ago

TFA falls into a few traps, like a reducto argument about text prediction. There's no reason text prediction can't do these things, fundamentally.

But I pretty much agree with what they are saying. The missing "thing" is the developer context. Each agent I kick off needs a nonlinearly increasing amount of coaching, as a function of feature complexity. The sweet spot for productivity is currently the first 3 steps (from TFA), to get things into _my head_, then using the writing abilities more as ubersed or ubergrep with LSP integration. Love it for that.

For example, I'll often write the first 5th to 3rd of a feature by hand, then ask the agent to extrapolate from there. The "Core" contains the important bits but in a large system there's a lot of corner cases and wiring, and agents are good a discovering those. I interrupt when it tries to fix things by departing from the design and instead nudge or write a better solution quickly.

I absolutely hate the "Spin a cadre of agents to design/implement a feature from a concise spec" workflow. It involves so much planning to get the automatic execution working that it's often just easier to switch to hybrid planning/execution with both AI and people.

cautiouscat•31m ago

I’ve also been finding that “Spin a cadre of agents to design/implement a feature from a concise spec” is really difficult. It’s been faster (for me) to do what you said and do a hybrid.

I’ve been trying out this cadre of agents idea with PR stacking and while I think it’s going to end up working fine, it took so much massaging to get it to where I needed it to be. Whereas with the hybrid approach, the problem space is a lot narrower and easier for me to define and the LLM to implement.

lubujackson•34m ago

The thing is AI can maintain systems. The key point is that it can't do this without human intent, but human intent can be encoded into skills and tied together with orchestration.

Rough example: have an LLM generate a plan. Have a skill that refines the plan considering security risks, another that ensures codebase structures are followed, another that considers the infrastructure and usage demands, etc. Then write code and tests. Another process to validate the tests, validate all the above, simplify the logic, etc.

The key is that an LLM can do every task capably, even in a complex system. We simply have not built reasonable orchestration of all the human intent behind each filter, and many of them are constantly in flux. It may be that some elements resist encoding because the complexity of encoding is not worth the hassle to maintain.

For better or worse, managing intent, orchestrating narrow agentic tasks and solidifying patterns into deterministic code (i.e. validation/tests) is going to be the focus of engineers going forward.

christkv•31m ago

I find it works if you do it in small parts of the system but systemwide really creates a lot of slop.

cold_harbor•18m ago

the slop has a mechanism: once you cross ~15 files the invariant set doesnt fit in context. locally correct edits, globally broken.

dvh•26m ago

If you spend $x amount of tokens to "produce a PR ready diff", how much of the $x are you willing to give upstream for incorporating your diff and maintain it in the future? So far ai folks seems to expect it to be $0. That's my only issue so far.

antirez•23m ago

> LLMs generate statistically plausible continuations of text

Jesus, it's fucking 2026. Even LeCun would never say this again.

dijksterhuis•13m ago

    max -log Pr([y_0, …, y_m] | [x_0, …, x_n])

aka most likely output sequence of tokens given an input sequence of tokens.

that statement you’re picking out to criticise is a literal translation of the maths behinds all sequence to sequence model training. it dates back at least ten years (see chapter 2 https://www.cs.toronto.edu/~graves/preprint.pdf)

there might have been some fancy additions in the last few years, but the general equations of sequence to sequence machine learning won’t have changed much. maximise likelihood of output sequence, given an input sequence.

adamtaylor_13•20m ago

This is a lot of words to confirm what we already know: we have exosuits, not robots.

Use them as capability enhancers, not drones who go do all the things without review.

injidup•17m ago

Why do people keep writing this drivel. Obviously written by an LLM itself. What they are describing and which doesn't work is one shotting a fix. Almost or probably no human can one shot a fix to a significant working system.

The human / llm needs to have some form of error correction signal. Either you have a corpus of tests or proof system that prevent regressions.

If you have a working system with no tests or validation and let a human loose on it then it will break. How is this different?

baq•17m ago

> LLMs generate statistically plausible continuations of text. This works well for self-contained tasks like writing a function or drafting documentation because these are pattern‑extension problems. But pattern‑matching is not system understanding, and plausibility is not correctness.

closes tab

Bnjoroge•11m ago

Yup. Just shows me that they are either oblivious about how much better they’ve gotten or simply unaware. Either way, hard to take their opinions seriously

Wowfunhappy•9m ago

Yeah, this is basically the same "LLMs are just next-word predictors", right?

It's obviously true... and yet when the next word is the completion of a chat template, suddenly they can talk to you. I don't know how far that will ultimately go, but "they're fundamentally just X" isn't providing useful information anymore.

r_lee•14m ago

LLM slop article about LLM slop. amazing how this stuff just gets instantly to the front page

jgbuddy•6m ago

Exactly

DanielHB•10m ago

One thing I realized is just how much the harnesses are geared towards _not_ parsing files and take shortcuts. And even then I am very unimpressed at the speed these systems output code and the amount of tokens you consume doing fairly basic stuff is quite high.

My gut feeling is that it will take at least a couple of orders of magnitude improvements before these LLMs can even parse large systems much less understand them holistically. And I don't see an order of magnitude improvement coming any time soon, it feels the last one was GPT 3.5.

EGreg•8m ago

"But agentic work is global and transformative: the LLM must change the system itself, which requires understanding dependencies, invariants, interactions, and downstream consequences.
This is causal reasoning, not pattern extension. LLMs predict tokens, not consequences — and that is why the leap from writing code to producing a safe, system‑aware PR‑ready diff is not incremental but a shift into a fundamentally different problem space."

This is well said. We need a new paradigm. I could go into the shortcomings of the current agent-oriented approaches but it would turn into a huge post. If you want to read it, I wrote it up here: http://safebots.ai/agents.html

jgbuddy•7m ago

How to get to HN front page: 1) AI generate an article about why AI sucks 2) Profit

passive•6m ago

I think this does a very good job of describing the real gaps agents are hitting in practical usage, along with a fairly compelling rationale for why those gaps aren't likely to disappear any time soon.

If we're going to stabilize the software industry, we need to have more discussions like this that identify what constraints apply. (We should have had those discussion before pushing AI out this widely, but that wouldn't have gotten anyone rich.)

I actually think that there's a world of software systems agents can change, but it's materially different from the one we have now, and has a different set of constraints that we've also mostly done a poor job identifying. So hopefully the discussion can help those of us on both sides. ;)

danielpardo•3m ago

I used Claude Code to migrate from Electron to Node + React across ~6k LOC. It handled the mechanical parts well but anything that has to do with creativity or field of interest required human judgment.

AI has no judgement or critical thinking even if it seems so, so we have to be wary to not let AI do this bc it will be poor quality and 0 innovative

When Quiet Undersea Volcanoes Turn Disruptive

Endive and the Next Chapter of WebAssembly on the JVM

The future of AI is an AI futures market

AI as Nervous System

A Luxury Survivalist Community Is Tearing Itself Apart

GPG-encrypted email forwarding is back, and the mxcrypt relay is now open source

Fun with Markov Chains

Show HN: Open-source Workspace (mail,docs,spreadsheet,drive) web/iOS

Show HN: Filemat – an open-source web-based file manager

Women who power America's offices are making themselves AI-proof

NASA takes steps toward building Moon Base, including discussing a "perimeter"

After 88 Days of Censored News, TV and Chat, Iranians Are Coming Back Online

Pay humans to engage in ads, not LinkedIn

FFmpeg Audio Normalization: The Complete Loudnorm Guide

'Lobotomized': Character.ai Is Showing What AI Enshittification Looks Like

Show HN: TruthLens – Free multi-signal deepfake image detector

Show HN: Claude Usage Tray – Windows Tray for Claude Code Rate Limits

The Correctness Layer: How We Beat Claude Code on the ADE Benchmark

Picking up the Pieces: Popular Spec-Driven Development tool abandoned, forked

Tlamatini – Local-first AI dev assistant with 68 agents and hybrid RAG

Intl.RelativeTimeFormat: the browser knows how to say '2 hours ago'

Claude skills bundle where each skill ships with its audit report

Ten Years of Charter 77

AVTR-1: Open-weight real-time flow-matching transformer for audio-driven avatars

Ubuntu Workshop

We built a lab to evaluate data agents

AI-Writing Scandals Are Getting Confusing

Greenpeace study finds microplastics in Gerber baby food pouches [pdf]

Declassified CIA Cartography Maps from the 1980s

Built a custom CMD that loads commands at runtime using dynamic linking

Why AI Agents Cannot Change Software Systems

Comments

When Quiet Undersea Volcanoes Turn Disruptive

Endive and the Next Chapter of WebAssembly on the JVM

The future of AI is an AI futures market

AI as Nervous System

A Luxury Survivalist Community Is Tearing Itself Apart

GPG-encrypted email forwarding is back, and the mxcrypt relay is now open source

Fun with Markov Chains

Show HN: Open-source Workspace (mail,docs,spreadsheet,drive) web/iOS

Show HN: Filemat – an open-source web-based file manager

Women who power America's offices are making themselves AI-proof

NASA takes steps toward building Moon Base, including discussing a "perimeter"

After 88 Days of Censored News, TV and Chat, Iranians Are Coming Back Online

Pay humans to engage in ads, not LinkedIn

FFmpeg Audio Normalization: The Complete Loudnorm Guide

'Lobotomized': Character.ai Is Showing What AI Enshittification Looks Like

Show HN: TruthLens – Free multi-signal deepfake image detector

Show HN: Claude Usage Tray – Windows Tray for Claude Code Rate Limits

The Correctness Layer: How We Beat Claude Code on the ADE Benchmark

Picking up the Pieces: Popular Spec-Driven Development tool abandoned, forked

Tlamatini – Local-first AI dev assistant with 68 agents and hybrid RAG

Intl.RelativeTimeFormat: the browser knows how to say '2 hours ago'

Claude skills bundle where each skill ships with its audit report

Ten Years of Charter 77

AVTR-1: Open-weight real-time flow-matching transformer for audio-driven avatars

Ubuntu Workshop

We built a lab to evaluate data agents

AI-Writing Scandals Are Getting Confusing

Greenpeace study finds microplastics in Gerber baby food pouches [pdf]

Declassified CIA Cartography Maps from the 1980s

Built a custom CMD that loads commands at runtime using dynamic linking