> AI is best suited for augmentation not automation.
i agree with this sentiment, but with the caveat of "when its not lying to you".the most frustrating part of these interactive ai assistants is when it sends me down a rabbit hole of an api that doesn't exist (but looks almost right)
This is not remotely true. Think of any business process around your company. 99.9% availability would mean only 1min26 per day allowed for instability/errors/downtime. Surely your human collaborators aren't hitting this SLA. A single coffee break immediately breaks this (per collaborator!).
Business Process Automation via AI doesn't need to be perfect. It simply needs to be sufficiently better than the status quo to pay for itself.
Reliability means 99.9% of the time when I hand something off to someone else it's what they want.
Availability means I'm at my desk and not at the coffee machine.
Humans very much are 99.9% accurate, and my deliverable even comes with a list of things I'm not confident about
This is an extraordinary claim, which would require extraordinary evidence to prove. Meanwhile, anyone who spends a few hours with colleagues in a predominantly typing/data entry/data manipulation service (accounting, invoicing, presales, etc.) KNOWS the rate of minor errors is humongous.
99.99% is just absurd.
The biggest variable though with all this is that agents don't have to one shot everything like a human because no one is going to pay a human to do the work 5 times over to make sure the results are the same each time. At some point that will be trivial for agents to always be checking the work and looking for errors in the process 24/7.
The claim is pretty clearly 'can' achieve (humans) vs 'do' achieve (LLM). Therefore one example of a human building a system at 99.9% reliability is sufficient to support the claim. That we can compute and prove reliability is really the point.
For example, the function "return 3" 100% reliably counts the Rs in strawberry. We can see the answer never changes, if it is correct once therefore, it will always be correct because the answer is always the same correct answer. A LLM can't do that, and infamously gave inaccurate results to that problem, not even reaching 80% accuracy.
For the sake of discussion, I'll define reliability to be the product of availability and accuracy and will assume accuracy (the right answer) and availability (able to get any answer) to be independent variables. In my example I held availability at a fixed 100% to illustrate why being able to achieve high accuracy is required for high reliability.
So, two points: humans can achieve 100% accuracy in the systems they build because we can prove correctness and do error checking. Because LLM cannot do 100%, frankly, there is going to be a problem that shows a distinction between max capabilities. While difficult, humans can build highly reliable complex systems. The computer is an example, that all the hardware interfaces together so well and works so often is remarkable.
Second, if every step along a pipeline is 99% reliable, then after 20 steps we are no longer talking about a system that usually works, but one that _rarely_ works. For a 20 step system to work above 50%, it really needs some steps that are effectively at 100%
(Your point remains largely the same, just more precise with the updated definition replacing 'reliable' with 'accurate'.)
I used to work on a backup application, it ran locally on our clients' machines. We had over 10000 clients. A 99.9% reliability would mean that there are 10 of our customers, at any one point, having a problem. It's not a question of uptime. It's a question of data integrity in this case. So 99.9% reliability could even leave us open to, potentially, 10 lawsuits. Also, about 10 support calls per day.
Now we only had about 10k customers at the time. Imagine if it were millions.
I'd imagine future agents will include training to design these checks into any output, validating against the checks before proceeding further. They may even include some minor risk assessment beforehand, such as "this aspect is crucial and needs to be 99% correct before proceeding further".
on a personal note, I'm happy to hear that. I've been apprehensive and haven't tried it, purely due to my fear of the cost.
So. It’s a potential superpower for personal projects, yet I don’t see it being very useful in a corporate setting.
I used Claude Code to make this little thing: https://github.com/Baughn/ScriptView
…took me thirty minutes. It wouldn’t have existed otherwise.
The problems begin when integrating hundreds of units prompted by different people or when writing for work which is both prolific and secret. The context is too limited even with RAG, one would need to train a model filled with secret information.
So basically non-commercial use is the killer app thus far for Claude code. I am sure there are some business people who are not happy about this.
You can also use API tokens, yes, but that’s 5-10x more expensive. So I wouldn’t.
https://techcrunch.com/2025/07/17/anthropic-tightens-usage-l...
100% agree as someone that uses API tokens. I use it via API tokens only because my work gave me some Anthropic keys and the directive "burn the tokens!" (they want to see us using it and don't give a crap about costs).
The size of the code base you are working in also matters. On an old, large code base, the cost does go up, though still not real high. On a new or relatively small code base, it is not unusual for my requests to cost a tenth of a cent. For what I am doing, paying with an API key is much cheaper than a subscription would be
On the other hand, if LLMs are doing the actual service development, that's something software engineers could be doing :)
Agents have captivated the minds of groups of people in each large engineering org. I have no idea what their goal is other then they work on “GenAI”. For over a year now they have been working on agents with the promise that the next framework that MSFT or Alphabet publishes will solve their woes. They don’t actually know what they are solving for except everything involves agents.
I have yet to see agents solve anything but for some reason this idea that having an agent that you can send anything and everything will solve all problems for the company. LLMs have a ton of interesting applications but agents have yet to grasp me as interesting, I also don’t understand why so many large companies have focused time around it. They are not going to be cracking the code ahead of a commercial tool or open source project. In the time spent toying around with agents there are a lot of interesting applications that could have built, some of which may be technically an agent but without so much focus and effort on trying to solve for all use cases.
Edit: after rereading my post wanted to clarify that I do think there is a place for tool call chains and the like but so many folks I have talked to first hand are trying to create something that works for everything and anything.
That said, I have been using LLMs for a while now with great benefit. I did not notice anything missing, and I am not sure what agents bring to the table. Do you know?
I updated a svelte component at work, and while i could test it in the browser and see it worked fine, the existing unit test suddenly started failing. I spent about an hour trying to figure out why the results logged in the test didn't match the results in the browser.
I got frustrated, gave in and asked Claude Code, an AI agent. The tool call loop is something like: it reads my code, then looks up the documentation, then proposed a change to the test which i approve, then it re-runs the test, feeds the output back into the AI, re-checks the documentation, and then proposes another change.
It's all quite impressive, or it would be if at one point it didn't randomly say "we fixed it! The first element is now active" -- except it wasn't, Claude thought the first element was element [1], when of course the first element in an array is [0]. The test hadn't even actually passed.
An hour and a few thousand Claude tokens my company paid for and got nothing back for lol.
Even in this example coding agent is short lived . I am curious about continuously running agents that are never done.
An example of my own, not agentic or running in a loop, but might be an interesting example of a use case for this stuff: I had a CSV file of old coupon codes I needed to process. Everything would start in limbo, uncategorized. Then I wanted to be able to search for some common substrings and delete them, search for other common substrings and keep them. I described what I wanted to do with Claude 3.7 and it built out a ruby script that gave me an interactive menu of commands like search to select/show all/delete selected/keep selected. It was an awesome little throwaway script that would’ve taken me embarrassingly long to write, or I could’ve done it all by hand in Excel or at the command line with grep and stuff, but I think it would’ve taken longer.
Honestly one of the hard things about using AI for me is remembering to try to use it, or coming up with interesting things to try. Building up that new pattern recognition.
I do think it’s a step up when done correctly. Thinking of tools like Cursor. Most of my concern and issue comes from the amount of folks I have seen trying to great a system that solves everything. I know in my org people were working on Agents without even a problem they were solving for. They are effectively trying to recreate ChatGPT which to me is a fools errand.
What do agents provide? Asynchronous work output, decoupled from human time.
That’s super valuable in a lot of use cases! Especially because it’s a prerequisite for parallelizing “AI” use (1 human : many AI).
But the key insight from TFA (which I 100% agree with) is that the tyranny of sub-100% reliability compounded across multiple independent steps is brutal.
Practical agent folks should be engineering risk / reliability, instead of happy path.
And there are patterns and approaches to do that (bounded inputs, pre-classification into workable / not-workable, human in the loop), but many teams aren’t looking at the right problem (risk/reliability) and therefore aren’t architecting to those methods.
And there’s fundamentally no way to compose 2 sequential 99% reliable steps into a 99% reliable system with a risk-naive approach.
Agents, besides tool use, also have memory, can plan work towards a goal, and can, through an iterative process (Reflect - Act), validate if they are on the right track.
https://en.wikipedia.org/wiki/Exploration%E2%80%93exploitati...
I'll manage my whiney emotions over the term Agents, but you'll have to hold a gun to my head before I embrace "Agentic", which is a thoroughly stupid word. "Scripted workflow" is what it is, but I know there are some true "visionaries" out there ready to call it "Sentient workflow".
What I am doing is definitely manual, it is the old-fashioned prompt-copy-paste-test-repeat cycle, but it has been educational.
I think it is a mix of fomo and the 'upside' potential of being able to minimize ( ideally remove ) the expensive "human component". Note, I am merely trying to portray a specific world model.
<< In the time spent toying around with agents there are a lot of interesting applications that could have built, some of which may be technically an agent but without so much focus and effort on trying to solve for all use cases.
Preaching to the choir man. We just got custom AI tool ( which manages to have all my industry specific restrictions rendering it kinda pointless, low context making it annoying, and slower than normal, because it now has to go through several layers of approval including 'bias' ).
At the same time, committee bickers over minute change to a process that has effectively no impact on anything of value.
Bonkers.
IOW, it's a case of C-suite "monkey see, monkey do" kicked off by management consultants with crap to sell for very high prices...
For me the only problem I have is I find typing slow and laborious. I've always said if I could find a way to type less I would take it. That's why I've been using tab completion and refactoring tools etc for years now. So I'm kind of excited about being able to get my thoughts into the computer more quickly.
But having it think for me? That's not a problem I have. Reading and assimilating information? Again, not a problem I have. Too much of this is about trying to apply a solution where there is no problem.
I keep hearing vague stuff exactly like your comment at work from management. Its so infuriating.
"AI is not good for what I do, therefore AI is useless"
Edit: not even going to reply to comments below as they continue down a singular path of oh you ought to know what they are trying to do. The only point I was making is orgs are going solution-first without a real problem they are trying to solve and I don’t think that is the right approach.
I've never understood the "do X to increase/decrease Y by Z%". I remember working at McDonalds and the managers worked themselves up into a frenzy to increase "sale of McSlurry by 10%". All it meant was that they nagged people more and sold less of something else. It's not like people's stomachs got 10% larger.
That is not to say you should work against your company, but bear in mind this is a goal and you should consider where you can add value outside of general code factory productivity and how for example you can become a force multiplier for the company.
Agreed with your annoyance at "they are replacing you" comments. like duh. Thats what they've been doing forever.
Occasionally it works and people stumble across a problem worth solving as they go about applying their solution to everything. But that's not planning or top-down direction. That's not identifying a target in advance.
The fundamental difference is we need HITL to reduce errors instead of HOTL which leads to the errors you mentioned
It seems the author never used prompt/workflow optimization techniques.
LLM-AutoDiff: Auto-Differentiate Any LLM Workflow https://arxiv.org/pdf/2501.16673
Also, if you look at any human process you will realize that none of them have a 100% reliability rate. Yet, even without that we can manufacture e.g. a plane, something which takes millions of steps, each without a 100% success rate.
I actually think the article makes some good points, but especially when you are making good points it is unnecessary to stretch credibility with exaggerating your arguments.
My point was that something extremely complex, like a plane, works, because the system tries hard to prevent compounding errors.
You can do maintenance, inspections, and replacement because of those specifications.
In software the equivalent of blueprints is code. The room for variation outside software “specifications” is infinite.
Human reliability when comes to assembling planes is also much higher than 99%, and LLM reliability creating code is much, much lower than 99%.
So until these techniques are baked into the model by OpenAI, you have to come up with these ideas yourself.
If 50% of training data is not factually accurate, this needs to be weeded out.
Some industries require a first principles approach, and there are optimal process flows that lead to accurate and predictable results. These need research and implementation by man and machine.
It's hard to make *one* good product (see startup failure rates). You couldn't make 12 (as seemingly a solo dev?) and you're surprised?
we've been working on Definite[0] for 2 years with a small team and it only started getting really good in the past 6 months.
0 - data stack + AI agent: https://www.definite.app/
Something seems off about that...
If most of these are one-shot deterministic workflows (as opposed of input-llm-tool loop usually meant by the current use of the term "ai agent"), it's not hard to assume you can build, test and deploy one in a month on average.
> agents that technically make successful API calls but can't actually accomplish complex workflows because they don't understand what happened.
It takes a long time to get these things right.
Just because we'd love to have fully intelligent, automatic agents, doesn't mean the tech is here. I don't work on anything that generates content (text, images, code). It's just slob and will bite you in the ass in the long run anyhow.
But it generates mistakes like say 1 in 10 times and I do not see it getting fixed unless we drastically change the LLM architecture. In future I am sure we will have much more robust systems if the current hype cycle doesn't ruin its trust with devs.
But the hit is real, I mean I would hire a lot less If i were to hire now as I can clearly see the dev productivity boost.. Learning curve for most of the topics are also drastically reduced as the loss in Google search result quality is now supplemented by LLMs.
But thing I can vouch for is automation and more streamlined workflows. I mean having normal human tasks being augmented by an LLM in a workflow orchestration framework. The LLM can return its confidence % along with the task results and for anything less than ideal confidence % the workflow framework can fall back on a human. But if done correctly with proper testing, guardrails and all, I can see LLM is going to replace human agents in several non-critical tasks within such workflows.
The point is not replacing humans but automating most of the work so the team size would reduce. For e.g. large e-commerce firms have 100s of employees manually verifying product description, images etc, scanning for anything from typos to image mismatch to name a few. I can see LLMs going to do their job in future.
the truth is that we stop thinking when we code like that.
If done right we all code through spec written in English, not code.
Not everyone has that opinion. I am not talking about non programmers jumping on the bandwagon but real technologists using it in real world programming [1].
> our brain goes into "watching TV mode"
How many people would have thought the same when calculator came? We can either think like - oh this tool is making kids dumb or you can think like the new tool can make them faster and efficient.
I am talking from my point of view. I am quite experienced and know what I'm doing in a lot of areas, having 18 years of experience. I am faster without agents, produce better code, know the code and can guarantee maintainability and less bugs. Why on earth would I change that? That it's going to get better in the future is a hypothetical which needs to be proven yet.
It is not a calculator tho. So stop with that nonsensical comparison already. You know exactly what I'm talking about and you're not arguing against my point. That's why I'm saing that your comment ins tangental. LLMs are not a calculator. My point is that LLMs make us neither faster nor more efficient. You trade quality and maintainability (slob) against quality. That's a different trade off and makes the two outcomes uncomparable.
I wrote a little about one such task, getting agents to supplement my markdown dev-log here: https://github.com/sutt/agro/blob/master/docs/case-studies/a...
But still not once have I seen an actual agent in the wild doing concrete work.
A “No True Agent” problem if you will.
ChatGPT's Deep Research mode is also an agent: it will keep crawling the web and refining things until it feels it has enough material to write a good response.
I've gone from skeptical to willing to humor to "yeah this is probably right" in about 5 months, basically I believe: if you scope the subject matter very very well, and then focus on the tooling that the model will require to do it's task, you get a high completion rate. There is a reluctance to lean into the non deterministic nature of the models, but actually if you provide really excellent tooling and scope super narrowly, it's generally acceptably good.
This blog post really makes the tooling part seem hard, and, well... it is, but not that hard - we'll see where this all goes, but I remain optimistic.
Same as it's always been.
For agents, that triangle is not very well quanitfied at the moment which makes all these investigations interesting but still risky.
This cost scaling will be an issue for this whole AI employee thing, especially because I imagine these providers are heavily discounting.
_Knowing how way leads to way_, the larger the task, the more chance there is for an early deviation to doom the viability of the solution in total. Thus for even the SOTA right now, agents that can work in parallel to generate several different solutions can reduce your time of manually refactoring the generation. I wrote a little about that process here: https://github.com/sutt/agro/blob/master/docs/case-studies/a...
Yes, very long workflows with no checks in between will have high error rates. This is true of human workflows too (which also have <100% accuracy at each step). Workflows rarely have this many steps in practice and you can add review points to combat the problem (as evidenced by the author building 12 of these things and not running into this problem)
Sounds like good business to me.
All you are really saying with this comment is you have an incredibly narrow set of interests and absolutely no intellectual curiosity.
And I have an AI workflow that generates much better posts than this.
I read and generate hundreds of posts every month. I have to read books on writing to keep myself sane and not sound like an AI.
Or it's possible that he is one of those people that _realy_ adopted LLMs into _all_ their workflow, I guess, and he thinks the output is good enough as is, because it captured his general points?
LLMs have certainly damaged trust in general internet reading now, that's for sure.
Judging by the other comments this is clearly low-effort AI slop.
> LLMs have certainly damaged trust in general internet reading now, that's for sure.
I hate that this is what we have to deal with now.
One reason why LLM generated text bothers me is because there's no conscious, coherent mind behind it. There's no communicative intent because language models are inherently incapable of it. When I read a blog post, I subconsciously create a mental model of the author, deduce what kind of common ground we might have and use this understanding to interpret the text. When I learn that an LLM generated a text I've read, that mental model shatters and I feel like I was lied to. It was just a machine pretending to be a human, and my time and attention could've been used to read something written by a living being.
I read blogs to learn about the thoughts of other humans. If I wanted to know what an LLM thought about the state of vibe coding, I could just ask one at any time.
Perhaps more interesting is whether their argument is valid and whether their math is correct.
"We can't allow this post to create FUD about the current hype on AI agents and we need the scam to continue as long as possible".
I’m at this stage where I’m fine with AI generated content. Sure, the verbosity sucks - but there’s an interesting idea here, but make it clear that you’ve used AI, and show your prompts.
My thinking: In a financial system collapse (a la The Big Short), the assets under analysis are themselves the things of value. Whereas betting on AI to collapse a technology business is at least one step removed from actual valuation, even assuming:
1. AI Agents do deliver just enough, and stay around long enough, for big corporations to lay off large number of employees
2. After doing so, AI quickly becomes prohibitively expensive for the business
3. The combination of the above factors tank business productivity
In the event of a perfect black swan, the trouble is that it's not actually clear that this combination of factors would result in concrete valuation drops. The business just "doesn't ship as much" or "ships more slowly". This is bad, but it's only really bad if you have competitors that can genuinely capitalise on that stall.
An example immediately on-hand: for non-AI reasons, the latest rumors are that Apple's next round of Macbook Pros will be delayed. This sucks. But isn't particularly damaging to the company's stock price because there isn't really a competitor in the market that can capitalise on that delay in a meaningful way.
Similarly, I couldn't really tell you what the most recent non-AI software features shipped by Netflix or Facebook or X actually were. How would I know if they're struggling internally and have stopped shipping features because AI is too expensive and all their devs were laid off?
I guess if you're looking for a severe black swan to bet against AI Agents in general, you'd need to find a company that was so entrenched and so completely committed to and dependent on AI that they could not financially survive a shock like that AND they're in a space where competitors will immediately seize advantage.
Don't get me wrong though, even if there's no opportunity to actually bet against that situation, it will still suck for literally everyone if it eventuates.
I don't think this one is worth shorting because there's no specific event to trigger the mindshare to start moving and validating your position. You'd have to wait for very big public failures before the herd start to move.
Claude Code is impressive but it still produces quite a bit of garbage in my experience, and coding agents are likely to be the best agents around for the foreseeable future.
This is not a new observation -- Clark's note on overestimating short term and underestimating long term impact of technology is one of my favorite patterns. My 2c.
This phrase is usually followed by some, you know...Math?
> AI tools aren't perfect yet. They sometimes make mistakes, and they can't always understand what you are trying to do. But they're getting better all the time, In the future, they will be more powerful and helpful. They'll be able to understand your code even better, and they'll be able to generate even more creative ideas.
From another post on the same site. [0]
Yup, slop.
[0]: https://utkarshkanwat.com/writing/review-of-coding-tools/
Humans can try things, learn, and iterate. LLMs still can't really do the second thing, you can feed back an error message into the prompt but the learning isn't being added to its weights so its knowledge doesn't compound with experience like it does for us.
I think there are still a few theoretical breakthroughs needed for LLMs to achieve AGI and one of them is "active learning" like this.
At the risk of making a terrible analogy, right now we're able to "give birth" to these machines after months of training, but once they're born, they can't really learn. Whereas animals learn something new every day, got to sleep, clean up their memories a bit, deleting some, solidifying others, and waking up with an improved understanding of the world.
Right now we train AI babies, dump them in the wild... and expect them to have all the answers.
More specific to agents, humans can also figure out how to use tools on the fly (even in the absence of documentation) where LLMs need human-built MCPs. This is also a significant limiting factor.
That’s not to say they’re not useful in their current state. They are. However, I believe it’s becoming clear that there’s a hard ceiling to how capable LLMs in their current form can become and it’s going to take something radically different to break through.
I'm not sure you can do that. As humans, we need to make things up in order to have theories to test. Like back in the day before Einstein when people thought that light traveled through an "aether" whose properties we needed to figure out how to measure, or today when we can't explain the mass imbalance of the universe so we create this concept called "dark matter."
Also, in my experience the problem has been getting worse, or at least not better. I asked Claude 3.7 some time ago how to restore a snapshot to an active database on AWS, and it cheerfully told me to go to the console and press the button. Except there is no button, because AWS docs specifically say you can't restore a snapshot to an active database.
Even for software applications like the Linux kernel, there would have been a theory in Linus' head - for example of what an operating system is, and how it should work.
A theory gives 100% correct predictions. Although the theory itself may not model the world accurately. Such feedback between the theory, and its application in the world causes iterations to the theory. From newtonian mechanics to relativity etc. From euclidean geometry to geometry of curved spaces etc.
Long story short, the LLM is a long way away from any of this. And to be fair to LLMs, the average human is not creating theories, it takes some genius to create them (newton, turing, etc). The average human is trading memes on social media.
Someone was saying that with an increasing number of attempts, or increasing context length, LLMs are less and less likely to solve a problem
(I searched for it but can't find it)
That matches my experience -- the corrections in long context can just as easily be anti-corrections, e.g. turning something that works into something that doesn't work
---
Actually it might have been this one, but there are probably multiple sources saying the same thing, because it's true:
Context Rot: How Increasing Input Tokens Impacts LLM Performance - https://news.ycombinator.com/item?id=44564248
In this report, we evaluate 18 LLMs, including the state-of-the-art GPT-4.1, Claude 4, Gemini 2.5, and Qwen3 models. Our results reveal that models do not use their context uniformly; instead, their performance grows increasingly unreliable as input length grows.
---
As far this question: how do we manage to create things like the Linux kernel or the Mars landers without AI
It's because human intelligence is a totally different thing than LLMs (contrary to what interested people will tell you)
Carmack said there are at least 5 or 6 big breakthroughs left before "AGI", and I think even that is a misleading framing. It's certainly possible that "AGI" will not be reached - there could be hardware bottlenecks, software/algorithmic questions, or other obstacles we haven't thought of
That is, I would not expect AI to create anything like the Linux kernel. The burden of proof is on the people who claim that, not the other way around !!!
Speaking of Apple, I just want to get it out there that I'm impressed that they're exhibiting self restraint in this AI era. I know they get bashed for not being "up to speed" with "the rest of the industry," but I believe they're doing this on purpose because they see what slop it is and they'd prefer to scope it down to release something more useful.
Hey, maybe humans aren't just like LLMs after all.
(End quote)
Isn't this just wrong? Isn't the author conflating accuracy of LLM output in each step to accuracy of final artifact which is a reproducible deterministic piece of code?
And they're completely missing that a person in the middle is going to intervene at some point to test it and at that point the output artifact's accuracy either goes to 100% or the person running the agent would backtrack.
Either am missing something or this does not seem well thought through.
In fact, the point of the whole article isn't that AI doesn't work; to the contrary, it's that long chains of (20+) actions with no human intervention (which many agentic companies promise) don't work.
And you mention testing, which certainly can be done. But when you have a large product and the code generator is unreliable (which LLMs always are), then you have to spend most of your time testing.
Yes, I get the context window increases over time and that for many purposes it's already sufficient enough, but the current paradigm forces you to compress your personal context into a prompt to produce a meaningful result. In a language as malleable as English, this doesn't feel like engineering so much as it feels like incantations and guessing. We're losing so, so much by skipping determinism.
For better or worse, everything we see and do ends up modifying our "weights", which is something current LLMs just architecturally can't do since the weights are read-only.
All I hear from LLM people is "you're just not using it right" or "it's all in the prompt" etc. That's not natural language. That's no different from programming any computer system.
I've found LLMs to be quite useful for language stuff like "rename this service across my whole Kubernetes cluster". But when it comes to specific things like "sort this API endpoint alphabetically" I find the amount of time to learn to construct an appropriate prompt is the same if I'd have just learnt to program, which I already have done. And then there's the energy used by the LLM to do it's thing which is enormously wasteful.
This right here is the nail on the head. When you use (a) language to ask a computer to return you a response, there's a word for that and it's "programming". You're programming the computer to return data. This is just programming at a higher level, but we've always been increasing the level at which we program. This is just a continuation of that. These systems are not magical, nor will they ever be.
Do they? I certainly don't. I don't know if it's my memory deficiency, but I frequently hit my "context window" when solving problems of sufficient complexity.
Can you provide some examples of problems where humans have such large context windows?
And all this can be novelly combined and reasoned with to come up with new stuff to put into the "context window", and it can be dynamically extended at any point (e.g. you recall something similar during a thought train and "bring it into context").
And all this was only the current task-specific window, which lives inside the sum total of your human experience window.
Human context windows are not linear. They have "holes" in them which are quickly filled with extrapolation that is frequently correct.
It's why you can give a human an entire novel, say "Christine" by Stephen King, then ask them questions about some other novel until their "context window" is filled, then switch to questions about "Christine" and they'll "remember" that they read the book (even if they get some of the details wrong).
> Can you provide some examples of problems where humans have such large context windows?
See above.
The reason is because humans don't just have a "context window", they have a working memory that is also their primary source of information.
IOW, if we change LLMs so that each query modifies the weights (i.e. each query is also another training data-point), then you wouldn't need a context window.
With humans, each new problem effectively retrains the weights to incorporate the new information. With current LLMs the architecture does not allow this.
This misses a key feature of agents though. They get feedback from linters, build logs, test runs and even screenshots. And they collect this feedback themselves. This means they can error correct some mistakes along the way.
The math works out differently, depending on how well it can collect automated feedback it is doing what you want.
For context, relevant information from steps can be cherrypicked to next stage.
The math works differently because AI (mostly) ignores irrelevant results. So steps actually increase reliability overall.
Second since llm are non deterministic in nature how do you know if the quality went from 90% to 30% there is no test you can write. What if model provider degrades quality you have no test for it
Whether or not one term in this equation currently compounds faster is a good question, or under what circumstances, etc., but presenting agentic abilities as always flawed thinking resulting in impossible long term task execution isn't right. Humans are flawed and require long, drawn out multi task thinking to get correct answers, and interacting with and getting feedback from the world outside the mind during a task execution process typically raises the chance of the correct answer being spit out in the end.
I'd agree that the agentic math isn't great at the moment, but if it's possible to reduce hallucinations or raise the strength and frequency effect of real world feedback on the model, you could see this playing out differently perhaps quite soon. There's at least a couple of examples of "we're already there".
1. Multi-turn agents can correct themselves with more steps, so the reductive error cascade thinking here is more wrong than right in my experience
2. The 99.9% production requirement is so contextual and misleading, when the real comparison is often something like "outage", "dead air", "active incident", "nobody on it", "prework before/around human work", "proactive task no one had time for before", etc.
Similar to infra as code, CI, and many other automation processes, there's mountains of work that isn't being done and LLMs can do entirely or large swathes of
Eg: you need good verifiers (to understand whether a task is done successfully or not). Many tasks have easier verifications than doing the task. YOu have five parallel generations with 80% accuracy, the probablity of getting one right (and a verifier which can pick that) goes to 99.96%. With multi step too, the math changes in a similar manner. It just needs a different approach than how we have built software till date. He even hints at a paradigm with 3-5 discrete step workflow which works superbly well. We need to build more in that way.
In the software world (like the article is talking about) this is the logic that has ruthlessly cut software QA teams over the years. I think quality has declined as a result.
Verifiers are hard because the possible states of the internal system + of the external world multiply rapidly as you start going up the component chain towards external-facing interfaces.
That coordination is the sort of thing that really looks appealing for LLMs - do all the tedious stuff to mock a dependency, or pre-fill a database, etc - but they have an unfortunate tendency to need to be 100% correct in order for the verification test that depends on them to be worth anything. So you can go further down the rabbit hole, and build verifiers for each of those pre-conditions. This might recurse a few times. Now you end up with the math working against you - if you need 20 things to all be 100%, then even high chances of each individual one starts to degrade cumulatively.
A human generally wouldn't bother with perfect verification of every case, it's too expensive. A human would make some judgement calls of which specific things to test in which ways based on their intimate knowledge of the code. White box testing is far more common than black box testing. Test a bunch of specific internals instead of 100% permutations of every external interface + every possible state of the world.
But if you let enough of the code to solve the task be LLM-generated, you stop being in a position to do white-box testing unless you take the time to internalize all the code the machine wrote for you. Now your time savings have shrunk dramatically. And in the current state of the world, I find myself having to correct it more often then not, further reducing my confidence and taking up more time. In some places you can try to work around this by adjusting your interfaces to match what the LLM predicts, but this isn't universal.
---
In the non-software world the situation is even more dire. Often verification is impossible without doing the task. Consider "generate a report on the five most promising gaming startups" - there's no canonical source to reference. Yet these are things people are starting to blindly hand off to machines. If you're an investor doing that to pick companies, you won't even find out if you're wrong until it's too late.
For non software world, people use majority voting most of the time.
The failure rate is high because you view it in series. At test time you need to know what is correct from the options (including nothing correct), you dont need to know why it failed. You can debug later. The challenge is how easily can you return to the right track.
Something I knew all along was that you build the system that lets you do it with the human in the loop, collect evaluation and training data [1] and then build a system which can do some of the work and possibly improve the quality of the rest of it.
[1] in that order because for any 'subjective' task you will need to evaluate the symbolic system even if you don't need to train it -- if you need to train the system, on the other hand, you'll still need to eval
If the LLM can’t answer a query it usually forwards the chat to a human support agent.
This is not being part of a defined workflow that requires structured output.
That isn't reliable either, but it supports the person who gets the mail on his desk in the end.
We sometimes get handwritten service protocols and the model we are using is very proficient in reading handwritten notes which you would have difficulties to parse yourself.
It works most of the time, but not often enough that AI could give autogenerated answers. For service quality reasons we don't want to impose any chatbot or AI on a customer.
Also data protection issues arise if you use most AI services today, so parsing customer contact info is a problem as well. We also rely on service partners to tell the truth about not using any data...
Agents are digital manufacturing machines and benefit from the same processes we identified for reliability in the real world
literally everybody talks about this lmao what are you on about https://www.youtube.com/watch?v=d5EltXhbcfA
Perhaps that's why MCP as a protocol is so interesting to people - MCP servers are a chance at a 'blank slate' in front of the enterprise system. You pull out only the parts you're interested in, you get to define clear boundaries when you build the MCP server, the LLM sees only what you want it to see and you hide the messiness of the enterprise system.
Retr0id•6mo ago
I was under the impression that some kind of caching mechanism existed to mitigate this
_heimdall•6mo ago
Retr0id•6mo ago
They don't go into implementation details but Gemini docs say you get a 75% discount if there's a context-cache hit: https://cloud.google.com/vertex-ai/generative-ai/docs/contex...
_heimdall•6mo ago
bakugo•6mo ago
_heimdall•6mo ago
csomar•6mo ago
LLMs degrade with long input regardless of caching.
blackbear_•6mo ago
stpedgwdgfhgdd•6mo ago
ilaksh•6mo ago
Too•6mo ago