frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

I'm betting against AI agents, despite building them

https://utkarshkanwat.com/writing/betting-against-agents/
221•Dachande663•6h ago

Comments

Retr0id•5h ago
> Each new interaction requires processing ALL previous context

I was under the impression that some kind of caching mechanism existed to mitigate this

_heimdall•4h ago
Caching would only help to keep the context around, but caching would only be needed if it still ultimately needs to read and process that cached context again.
Retr0id•4h ago
You can cache the whole inference state, no?

They don't go into implementation details but Gemini docs say you get a 75% discount if there's a context-cache hit: https://cloud.google.com/vertex-ai/generative-ai/docs/contex...

_heimdall•4h ago
It that just avoids having to send the full context for follow-up requests, right? My understanding is that caching helps to keep the context around but can't avoid the need to process that context over and over during inference.
bakugo•1h ago
The initial context processing is also cached, which is why there's a significant discount on the input token cost.
csomar•4h ago
My understanding is that caching reduce computation but the whole input is still processed. I don’t think is fully disclosing how their cache works.

LLMs degrade with long input regardless of caching.

blackbear_•4h ago
You have to compute attention between all pairs of tokens at each step, making the naive implementation O(N^3). This is optimized by caching the previous attention values, so that for each step you only need to compute attention between the new token and all previous ones. That's much better but still O(N^2) to generate a sequence of N tokens.
stpedgwdgfhgdd•3h ago
Compact the conversation (CC)
ilaksh•3h ago
Yes, prompt caching helps a lot with the cost. It still adds up if you have some tool outputs with long text. I have found that breaking those out into subtasks makes the overall cost much more reasonable.
dmezzetti•5h ago
It's clear that what we currently call AI is best suited for augmentation not automation. There are a lot of productivity gains available if you're willing to accept that.
vntok•4h ago
> Production systems need 99.9%+ reliability

This is not remotely true. Think of any business process around your company. 99.9% availability would mean only 1min26 per day allowed for instability/errors/downtime. Surely your human collaborators aren't hitting this SLA. A single coffee break immediately breaks this (per collaborator!).

Business Process Automation via AI doesn't need to be perfect. It simply needs to be sufficiently better than the status quo to pay for itself.

hansmayer•4h ago
This may not be about internal business processes. In e-commerce 90 sec can be a lot of revenue lost, and mission-critical applications such as telecommunications or air control, it would be downright a disaster (ever heard of five nines availability)?
lexicality•4h ago
Currently I'm thinking about how furious the developers get any time Jenkins has any kind of hiccough, even if the solution is just "re-run the workflow" - and that's just network timeouts! I don't want to imagine the tickets if the CI system started spitting out hallucinations...
Pasorrijer•4h ago
I think you're crossing reliability and availability.

Reliability means 99.9% of the time when I hand something off to someone else it's what they want.

Availability means I'm at my desk and not at the coffee machine.

Humans very much are 99.9% accurate, and my deliverable even comes with a list of things I'm not confident about

vntok•3h ago
> Humans very much are 99.9% accurate

This is an extraordinary claim, which would require extraordinary evidence to prove. Meanwhile, anyone who spends a few hours with colleagues in a predominantly typing/data entry/data manipulation service (accounting, invoicing, presales, etc.) KNOWS the rate of minor errors is humongous.

satyrun•2h ago
Yea exactly.

99.99% is just absurd.

The biggest variable though with all this is that agents don't have to one shot everything like a human because no one is going to pay a human to do the work 5 times over to make sure the results are the same each time. At some point that will be trivial for agents to always be checking the work and looking for errors in the process 24/7.

navane•4h ago
It's not just about up time. If the bridge collapses people die. Some of us aren't selling ads.
vntok•3h ago
If "the bridge collapses and people die" because the team has a 1min26 "downtime" on a specific day, which is what you are arguing, then you have much bigger problems to solve than the performance of AI agents.
lerchmo•3h ago
Alot of deterministic systems externalize their edge cases to the user. The software design doesn’t fully match the reality of how it gets used. Ai can be far more flexible in the face of dynamic and variable requirements.
KoolKat23•4h ago
Human multi-step workflows tend to have checkpoints where the work is validated before proceeding further, as humans generally aren't 99%+ accurate either.

I'd imagine future agents will include training to design these checks into any output, validating against the checks before proceeding further. They may even include some minor risk assessment beforehand, such as "this aspect is crucial and needs to be 99% correct before proceeding further".

a_bonobo•4h ago
That's what Claude Code does - it constantly stops and asks you whether you want to proceed, including showing you the suggested changes before they're implemented. Helps with avoiding token waste and 'bad' work.
KoolKat23•4h ago
thats good to hear, theyre on their way there!

on a personal note, I'm happy to hear that. I've been apprehensive and haven't tried it, purely due to my fear of the cost.

queenkjuul•2h ago
My work has a corporate subscription and on the one hand it's very impressive and on the other i don't actually find it useful.
Filligree•2h ago
It’s best at small to medium projects written in a consistent style.

So. It’s a potential superpower for personal projects, yet I don’t see it being very useful in a corporate setting.

I used Claude Code to make this little thing: https://github.com/Baughn/ScriptView

…took me thirty minutes. It wouldn’t have existed otherwise.

Filligree•2h ago
The standard way to use Claude Code is with a constant-cost subscription; one of their standard website accounts. It’s rate-limited but still generous.

You can also use API tokens, yes, but that’s 5-10x more expensive. So I wouldn’t.

sarchertech•1h ago
If API tokens are 10x more expensive doesn’t that imply that the constant-cost subscription is massively subsidized?
jampekka•45m ago
Relies on many of the subscribers underusing their quota?
_fat_santa•49m ago
> You can also use API tokens, yes, but that’s 5-10x more expensive. So I wouldn’t.

100% agree as someone that uses API tokens. I use it via API tokens only because my work gave me some Anthropic keys and the directive "burn the tokens!" (they want to see us using it and don't give a crap about costs).

freedomben•32m ago
This is going to depend on what you're doing with it. I use Claude code for some stuff multiple times a day, and it is an unusual for a session to cost me $0.05. Even the most expensive thing I did ended up costing like $6, and that was a big and intensive workflow.

The size of the code base you are working in also matters. On an old, large code base, the cost does go up, though still not real high. On a new or relatively small code base, it is not unusual for my requests to cost a tenth of a cent. For what I am doing, paying with an API key is much cheaper than a subscription would be

csomar•4h ago
Lots of applications have to be redesigned around that. My guess is that micro-services architecture will see a renaissance since it plays well with LLMs.
Simon_O_Rourke•4h ago
Don't tell management about this, as they're all betting the house on AI agents next year.
pmg101•4h ago
Only one of these outcomes will be correct, so worth putting money on it if you think they're wrong a la The Big Short.
DavidPiper•3h ago
Not OP, but I've been thinking about this and concluded it's not quite so clear-cut. If I was going to go down this path, I think I would bet on competitors, rather than against incumbents.

My thinking: In a financial system collapse (a la The Big Short), the assets under analysis are themselves the things of value. Whereas betting on AI to collapse a technology business is at least one step removed from actual valuation, even assuming:

1. AI Agents do deliver just enough, and stay around long enough, for big corporations to lay off large number of employees

2. After doing so, AI quickly becomes prohibitively expensive for the business

3. The combination of the above factors tank business productivity

In the event of a perfect black swan, the trouble is that it's not actually clear that this combination of factors would result in concrete valuation drops. The business just "doesn't ship as much" or "ships more slowly". This is bad, but it's only really bad if you have competitors that can genuinely capitalise on that stall.

An example immediately on-hand: for non-AI reasons, the latest rumors are that Apple's next round of Macbook Pros will be delayed. This sucks. But isn't particularly damaging to the company's stock price because there isn't really a competitor in the market that can capitalise on that delay in a meaningful way.

Similarly, I couldn't really tell you what the most recent non-AI software features shipped by Netflix or Facebook or X actually were. How would I know if they're struggling internally and have stopped shipping features because AI is too expensive and all their devs were laid off?

I guess if you're looking for a severe black swan to bet against AI Agents in general, you'd need to find a company that was so entrenched and so completely committed to and dependent on AI that they could not financially survive a shock like that AND they're in a space where competitors will immediately seize advantage.

Don't get me wrong though, even if there's no opportunity to actually bet against that situation, it will still suck for literally everyone if it eventuates.

conartist6•3h ago
If you want to bet on a competitor, let's talk cause I'm your guy. While everyone else was looking the other way, I stole home: https://github.com/bablr-lang
Quarrelsome•3h ago
shorting only works if people realise it when you do. c-suite will run out of make up before admitting its a pig because the pay off is huge for them. I reckon agentic dev can function "just enough" to allow them to delay the reality for a bit while they fire more of their engineering team.

I don't think this one is worth shorting because there's no specific event to trigger the mindshare to start moving and validating your position. You'd have to wait for very big public failures before the herd start to move.

ptero•3h ago
While true, the world doesn't end in 2025. While I would also agree that big financial benefits from agents to companies appear unlikely to arrive this year (and the title specifically mentions 2025) I would bet on agents becoming a disruptive technology in the next 5-10 years. My 2c.
corentin88•2h ago
Why this timeline? What’s missing today that would make it possible in 5-10 years?
queenkjuul•2h ago
Better models?

Claude Code is impressive but it still produces quite a bit of garbage in my experience, and coding agents are likely to be the best agents around for the foreseeable future.

ptero•1h ago
Just empirical observations. It takes time to propagate technology down to general businesses and business methods up to technology developers. The "propagate down to business methods" is the slower path, as it requires business leaders to become familiar enough with technology to get ideas on how to leverage it.

This is not a new observation -- Clark's note on overestimating short term and underestimating long term impact of technology is one of my favorite patterns. My 2c.

chelmzy•29m ago
This is what I try to explain to people who ask "If LLMs are so good why haven't they replaced workers?". Well it takes a long time for the railroads to be built. What use is a locomotive without rails?
exe34•2h ago
Do you have suggestions on how one would go about doing this? Do you just approach a betting company and make some prediction against some wager?
trentnix•24m ago
They're just following the herd.
infecto•4h ago
Link does not work for me but as someone who does a lot of work with LLMs I am also betting against agents.

Agents have captivated the minds of groups of people in each large engineering org. I have no idea what their goal is other then they work on “GenAI”. For over a year now they have been working on agents with the promise that the next framework that MSFT or Alphabet publishes will solve their woes. They don’t actually know what they are solving for except everything involves agents.

I have yet to see agents solve anything but for some reason this idea that having an agent that you can send anything and everything will solve all problems for the company. LLMs have a ton of interesting applications but agents have yet to grasp me as interesting, I also don’t understand why so many large companies have focused time around it. They are not going to be cracking the code ahead of a commercial tool or open source project. In the time spent toying around with agents there are a lot of interesting applications that could have built, some of which may be technically an agent but without so much focus and effort on trying to solve for all use cases.

Edit: after rereading my post wanted to clarify that I do think there is a place for tool call chains and the like but so many folks I have talked to first hand are trying to create something that works for everything and anything.

JKCalhoun•3h ago
Link is working for me — perhaps it was not 30 minutes ago? (Safari, MacOS)
johnisgood•3h ago
I have no idea what agents are for, could be my own ignorance.

That said, I have been using LLMs for a while now with great benefit. I did not notice anything missing, and I am not sure what agents bring to the table. Do you know?

mhog_hn•3h ago
An agent is an LLM + a tool call loop - it is quite a step up in terms of value in my experience
johnisgood•3h ago
What is the use case? What does it solve exactly, or what practical value does it give you? I am not sure what a tool call loop is.
ghuntley•3h ago
> I am not sure what a tool call loop is.

See https://ampcode.com/how-to-build-an-agent

kro•3h ago
The tools can be an editor/terminal/dev environment, automatically iterating to testing the changes and refining until a finished product, without a human developer, at least that is what some wish of it.
johnisgood•2h ago
Oh, okay, I understand it now, especially with the other comment that said Cursor is one. OK, makes sense. Seems like it "just" reduces friction (quite a lot).
csande17•2h ago
Yeah, it's really just a user experience improvement. In particular, it makes AI look a lot better if it can internally retry a bunch of times until it comes up with valid code or whatever, instead of you having to see each error and prompt it to fix it. (Also, sometimes they can do fancy sampling tricks to force the AI to produce a syntactically valid result the first time. Mostly this is just used for simple JSON schemas though.)
johnisgood•1h ago
Thank you, that is what my initial thought was. I am still doing things the old-fashioned way, thankfully it has worked out for me (and learned a lot in the process), but perhaps this AI agent thing might speed things up a bit. :D Although then I will learn much less.
infecto•3h ago
Cursor is my classic example. I don’t know exactly what tools are defined in their loop but you give the agent some code to write. It may search your code base, it may then search online for third party library docs. Then come back and write some code etc.
queenkjuul•2h ago
An example:

I updated a svelte component at work, and while i could test it in the browser and see it worked fine, the existing unit test suddenly started failing. I spent about an hour trying to figure out why the results logged in the test didn't match the results in the browser.

I got frustrated, gave in and asked Claude Code, an AI agent. The tool call loop is something like: it reads my code, then looks up the documentation, then proposed a change to the test which i approve, then it re-runs the test, feeds the output back into the AI, re-checks the documentation, and then proposes another change.

It's all quite impressive, or it would be if at one point it didn't randomly say "we fixed it! The first element is now active" -- except it wasn't, Claude thought the first element was element [1], when of course the first element in an array is [0]. The test hadn't even actually passed.

An hour and a few thousand Claude tokens my company paid for and got nothing back for lol.

apwell23•1h ago
any examples outside of coding agents ?

Even in this example coding agent is short lived . I am curious about continuously running agents that are never done.

queenkjuul•1h ago
No, the fact Claude couldn't remember that JavaScript is zero-indexed for more than 20 minutes has not left me interested in letting it take on bigger tasks
jsemrau•1h ago
If it were only tool use, then it would be the same as a lambda function.
infecto•3h ago
Not a disagreement with you but wanted to further clarify.

I do think it’s a step up when done correctly. Thinking of tools like Cursor. Most of my concern and issue comes from the amount of folks I have seen trying to great a system that solves everything. I know in my org people were working on Agents without even a problem they were solving for. They are effectively trying to recreate ChatGPT which to me is a fools errand.

ethbr1•58m ago
I’d boil it down thusly:

What do agents provide? Asynchronous work output, decoupled from human time.

That’s super valuable in a lot of use cases! Especially because it’s a prerequisite for parallelizing “AI” use (1 human : many AI).

But the key insight from TFA (which I 100% agree with) is that the tyranny of sub-100% reliability compounded across multiple independent steps is brutal.

Practical agent folks should be engineering risk / reliability, instead of happy path.

And there are patterns and approaches to do that (bounded inputs, pre-classification into workable / not-workable, human in the loop), but many teams aren’t looking at the right problem (risk/reliability) and therefore aren’t architecting to those methods.

And there’s fundamentally no way to compose 2 sequential 99% reliable steps into a 99% reliable system with a risk-naive approach.

jsemrau•1h ago
Agents are more than that.

Agents, besides tool use, also have memory, can plan work towards a goal, and can, through an iterative process (Reflect - Act), validate if they are on the right track.

ivape•35m ago
If an agent takes a Topic A and goes down a rabbit hole all the way to Topic Z, you'll see that it won't be able to incorporate or backtrack back to Topic A without losing a lot of detail from the trek down to Topic Z. It's a serious limitation right now from the application development side of things, but I'm just reiterating what the article pointed out, which is that you need to work with fewer step workflows that isn't as ambitious as covering all things from A-Z.
ivape•44m ago
You are a manual agent to LLMs when you use things like ChatGPT. You go through a workflow loop when you try to investigate and consult with an LLM. Agents are just trying to automate your workflow against an LLM. It's basically just scripting. Scripting these LLMs is where we all want to go, but the context window length is a limiting factor, as well as inferencing on any notable sized window.

I'll manage my whiney emotions over the term Agents, but you'll have to hold a gun to my head before I embrace "Agentic", which is a thoroughly stupid word. "Scripted workflow" is what it is, but I know there are some true "visionaries" out there ready to call it "Sentient workflow".

A4ET8a8uTh0_v2•2h ago
<< I also don’t understand why so many large companies have focused time around it. They are not going to be cracking the code ahead of a commercial tool or open source project.

I think it is a mix of fomo and the 'upside' potential of being able to minimize ( ideally remove ) the expensive "human component". Note, I am merely trying to portray a specific world model.

<< In the time spent toying around with agents there are a lot of interesting applications that could have built, some of which may be technically an agent but without so much focus and effort on trying to solve for all use cases.

Preaching to the choir man. We just got custom AI tool ( which manages to have all my industry specific restrictions rendering it kinda pointless, low context making it annoying, and slower than normal, because it now has to go through several layers of approval including 'bias' ).

At the same time, committee bickers over minute change to a process that has effectively no impact on anything of value.

Bonkers.

globular-toast•2h ago
I think in general if everyone is talking about a solution and nobody is talking about problems then it's a sign we're in a bubble.

For me the only problem I have is I find typing slow and laborious. I've always said if I could find a way to type less I would take it. That's why I've been using tab completion and refactoring tools etc for years now. So I'm kind of excited about being able to get my thoughts into the computer more quickly.

But having it think for me? That's not a problem I have. Reading and assimilating information? Again, not a problem I have. Too much of this is about trying to apply a solution where there is no problem.

georgeplusplus•1h ago
Maybe you are in a job where it’s not a good use case but there are fields that are handling massive amounts of data or have a huge amount of time waiting for processing data before moving to the next step that I think handing it off to an AI agent to solve then a human puts the pieces together based on its own logic and experiences would work quite nice.
apwell23•1h ago
not quite sure what you are proposing here. what exactly is AI agent solving in this example?

I keep hearing vague stuff exactly like your comment at work from management. Its so infuriating.

georgeplusplus•7m ago
For instance cyber security toolsets like mde capture a lot of data. That data is made meaningless unless someone is looking through it, at my org there isn’t enough manpower to do that, so one solution is using an agent to help characterize that network log data into suspicious or what’s worthy of a human to follow up on.
wooque•2h ago
>I have no idea what their goal is

goal is to fire you (human), decrease costs and increase profits

infecto•2h ago
That’s a bit reductive and misses the core issue. Of course companies want to reduce headcount or boost productivity, but many are pursuing these initiatives without a clear problem in mind. If the mandate were, say, “we’re building X to reduce customer support staff by 20%,” that would be a different story. Instead, it often feels like solution-first thinking without a clear target.

Edit: not even going to reply to comments below as they continue down a singular path of oh you ought to know what they are trying to do. The only point I was making is orgs are going solution-first without a real problem they are trying to solve and I don’t think that is the right approach.

exe34•2h ago
> “we’re building X to reduce customer support staff by 20%,”

I've never understood the "do X to increase/decrease Y by Z%". I remember working at McDonalds and the managers worked themselves up into a frenzy to increase "sale of McSlurry by 10%". All it meant was that they nagged people more and sold less of something else. It's not like people's stomachs got 10% larger.

coliveira•11m ago
The sad part is that companies doing this will very soon figure out that the 20% less staff they "achieved" is only at a cost of 100% increase in development and fees to LLM vendor. Moreover, after a few years these fees will skyrocket because their businesses are now dependent on this technology and unlike people, LLMs are monopolized by just a few robber barons.
figassis•1h ago
That is not a goal that can be shared without alienating the current workforce. So you can bet that goal was clearly stated at CXO level, and is being communicated/translated piece wise as let’s find out how much more productive we can get with AI. You’re going to find out about the goal once you reach it.

That is not to say you should work against your company, but bear in mind this is a goal and you should consider where you can add value outside of general code factory productivity and how for example you can become a force multiplier for the company.

apwell23•1h ago
yes my organization head at my employer has asked us to submit: "Generative AI Agent" proposals for upcoming planning session. Apparently those ideas will get the big seat at the planning table. I've been trying to think of many ideas but they all end up being some sort of workflow automation that was possible without agent stuff.

Agreed with your annoyance at "they are replacing you" comments. like duh. Thats what they've been doing forever.

danieltanfh95•4h ago
Same. https://danieltan.weblog.lol/2025/06/agentic-ai-is-a-bubble-...

The fundamental difference is we need HITL to reduce errors instead of HOTL which leads to the errors you mentioned

Xmd5a•3h ago
>A database query might return 10,000 rows, but the agent only needs to know "query succeeded, 10k results, here are the first 5." Designing these abstractions is an art.

It seems the author never used prompt/workflow optimization techniques.

LLM-AutoDiff: Auto-Differentiate Any LLM Workflow https://arxiv.org/pdf/2501.16673

constantcrying•3h ago
No, it is not "mathematically impossible". It is empirically implausible. There is no statement in mathematics that says that agents can not have a 99.999% reliability rate.

Also, if you look at any human process you will realize that none of them have a 100% reliability rate. Yet, even without that we can manufacture e.g. a plane, something which takes millions of steps, each without a 100% success rate.

I actually think the article makes some good points, but especially when you are making good points it is unnecessary to stretch credibility with exaggerating your arguments.

macleginn•3h ago
This is a good point, but it seems, empirically, that most parts of a standard passenger airplane have reliability approximating 100% in a predefined time window with proper inspection and maintenance, otherwise passenger transit would be impossible. When the system does start to degrade, e.g. because replacement parts and maintenance becomes unavailable or too costly (cf. the use of imported planes by Russian airlines after the sanctions hit), incidents quickly start piling up.
constantcrying•3h ago
It's about what you do with errors. If you let them compound they lead to destruction, if instead you inspect, maintain, reinspect, replace, etc. you can manage them.

My point was that something extremely complex, like a plane, works, because the system tries hard to prevent compounding errors.

sarchertech•1h ago
That works because each plane is (nearly) exactly the same as the one before it and we have exact specifications for the plane.

You can do maintenance, inspections, and replacement because of those specifications.

In software the equivalent of blueprints is code. The room for variation outside software “specifications” is infinite.

Human reliability when comes to assembling planes is also much higher than 99%, and LLM reliability creating code is much, much lower than 99%.

john_minsk•2h ago
Valid point, however the promise of AI is that it will be able to manufacture a metaphorical “plane” for each and every prompt user inputs I.e. give 100% overall reliability by using all kinds of techniques (testing, decomposing etc) that intelligence can come up with.

So until these techniques are baked into the model by OpenAI, you have to come up with these ideas yourself.

deadbabe•3h ago
I just want someone to give me one legit use case where an AI Agent now enables them to do something that couldn’t be done before, and actually makes an impact on overall profit.
digitcatphd•3h ago
I’m sure most of the problems cited in this article will be easily solved within the next five years or so, waiting for perfection and doing nothing won’t pay dividends
snappr021•3h ago
The alternative is building Functional Intelligence process flows from the ground up on a foundation of established truth?

If 50% of training data is not factually accurate, this needs to be weeded out.

Some industries require a first principles approach, and there are optimal process flows that lead to accurate and predictable results. These need research and implementation by man and machine.

mritchie712•2h ago
> I've built 12+ production AI agent systems across development, DevOps, and data operations

It's hard to make *one* good product (see startup failure rates). You couldn't make 12 (as seemingly a solo dev?) and you're surprised?

we've been working on Definite[0] for 2 years with a small team and it only started getting really good in the past 6 months.

0 - data stack + AI agent: https://www.definite.app/

AstroBen•39m ago
They've built 12+ products with a full time job for the last 3 years

Something seems off about that...

RamblingCTO•2h ago
I also build agents/ai automation for a living. Coding agents or anything open-ended is just a stupid idea. It's best to have human validated checkpoints, small search spaces and very specific questions/prompts (does this email contain an invoice? YES/NO).

Just because we'd love to have fully intelligent, automatic agents, doesn't mean the tech is here. I don't work on anything that generates content (text, images, code). It's just slob and will bite you in the ass in the long run anyhow.

la_fayette•2h ago
In general I would agree, however the resulting systems of such an approach tend to be "just" expensive workflow systems, which could be done with old tech as well... Where is the real need for anything LLM here?
barbazoo•40m ago
Extracting structured data from unstructured text comes to mind. We’ve built workflows that we couldn’t before by bridging a non deterministic gap. It’s a business SaaS but the folks using our software seem to be really happy with the result.
rco8786•2h ago
I still don’t even know what an agent is. Everyone seems to have their own definition. And invariably it’s generic vagaries about architecture, responsibilities of the LLM, sub-agents, comparisons to workflows, etc.

But still not once have I seen an actual agent in the wild doing concrete work.

A “No True Agent” problem if you will.

iamjackg•2h ago
Technically speaking, Claude Code is an agent, for example. It's just a fancy term for an LLM that can call tools in a loop until it thinks it's done with whatever it was tasked to do.

ChatGPT's Deep Research mode is also an agent: it will keep crawling the web and refining things until it feels it has enough material to write a good response.

neom•2h ago
"The real challenge isn't AI capabilities, it's designing tools and feedback systems that agents can actually use effectively." - this part I agree with - I'd been sitting the AI stuff out because I was unclear where I thought the dust would settle or what the market would accept, but recently joined a very small startup focused on building an agent.

I've gone from skeptical to willing to humor to "yeah this is probably right" in about 5 months, basically I believe: if you scope the subject matter very very well, and then focus on the tooling that the model will require to do it's task, you get a high completion rate. There is a reluctance to lean into the non deterministic nature of the models, but actually if you provide really excellent tooling and scope super narrowly, it's generally acceptably good.

This blog post really makes the tooling part seem hard, and, well... it is, but not that hard - we'll see where this all goes, but I remain optimistic.

johndhi•1h ago
From what I understand customer support chatbots have had some pretty good outcomes from ai agents. Or does that not count?
nsypteras•44m ago
I think that would be one of the success cases described in the article because HITL is an integral part of good customer support chatbots. Support chats can be escalated to a human whenever the agent is unable to provide a satisfactory answer to the user.
jvanderbot•1h ago
My AI tool use has been a net positive experience at work. It can take over small tasks when I need a break, clean up or start momentum, and generally provide a good helping hand. But even if it could do my job, the costs pile up really quickly. Claude Code can burn $25/ 1-2 hrs, easily on a large codebase, and that's creeping along at a net positive rate assuming I can keep it on task and provide corrections. If you automate the corrections we are up to $50/hr or some tradeoff of speed, accuracy, and cost.

Same as it's always been.

For agents, that triangle is not very well quanitfied at the moment which makes all these investigations interesting but still risky.

swader999•9m ago
Subscription?
atomon•1h ago
Is the main point “let me mathematically prove that it’s impossible to do what I’ve already done 12 times this year?”

Yes, very long workflows with no checks in between will have high error rates. This is true of human workflows too (which also have <100% accuracy at each step). Workflows rarely have this many steps in practice and you can add review points to combat the problem (as evidenced by the author building 12 of these things and not running into this problem)

tomhow•1h ago
[stub for offtopicness]
roschdal•5h ago
AI is for people without natural intelligence.
bboygravity•4h ago
So it's for 90+ percent of society?

Sounds like good business to me.

block_dagger•4h ago
Downvotes are for comments like yours
satyrun•2h ago
Yea just average IQ like Terence Tao.

All you are really saying with this comment is you have an incredibly narrow set of interests and absolutely no intellectual curiosity.

paradite•5h ago
This is obviously AI generated, if that matters.

And I have an AI workflow that generates much better posts than this.

Retr0id•4h ago
I think it's just written by someone who reads a lot of LLM output - lots of lists with bolded prefixes. Maybe there was some AI-assistance (or a lot), but I didn't get the impression that it was AI-generated as a whole.
paradite•4h ago
"Hard truth" and "reality check" in the same post is dead giveaway.

I read and generate hundreds of posts every month. I have to read books on writing to keep myself sane and not sound like an AI.

Retr0id•4h ago
True, the graphs are also wonky - the curves don't match the supposed math.
queenkjuul•2h ago
Yeah that was confusing to me
squigglydonut•4h ago
Absolutely! And you're right to think that. Here's why...
kookamamie•3h ago
Applogies! You're exactly right, here's how this spans out…
jrexilius•3h ago
The thing that sucks about it is maybe his english is bad (not his native language) so he relies on LLM output for his posts. Im inclined to cut people slack for this. But the rub is that it is indistinguishable from spam/slop generated for marketing/ads/whatever.

Or it's possible that he is one of those people that _realy_ adopted LLMs into _all_ their workflow, I guess, and he thinks the output is good enough as is, because it captured his general points?

LLMs have certainly damaged trust in general internet reading now, that's for sure.

paradite•2h ago
I am not pro or against AI-generated posts. I was just making an observation and testing my AI classifier.
fleebee•2h ago
The graphs don't line up. I'm inclined to believe they were hallucinated by an LLM and the author either didn't check them or didn't care.

Judging by the other comments this is clearly low-effort AI slop.

> LLMs have certainly damaged trust in general internet reading now, that's for sure.

I hate that this is what we have to deal with now.

delis-thumbs-7e•3h ago
I wonder why a person from Bombay India might use AI to aid with an English language blog post…

Perhaps more interesting is whether their argument is valid and whether their math is correct.

rvz•4h ago
Let's get a timer to watch this fall off the front page of HN in minutes.

"We can't allow this post to create FUD about the current hype on AI agents and we need the scam to continue as long as possible".

vntok•4h ago
Generally speaking, low quality posts don't spend too much time on the front page, regardless of their topic.
saadatq•4h ago
we need a flag button for “written by AI”.

I’m at this stage where I’m fine with AI generated content. Sure, the verbosity sucks - but there’s an interesting idea here, but make it clear that you’ve used AI, and show your prompts.

rvz•2h ago
... and it's gone. Stopped the timer on 2 hours and 38 mins.
d4rkn0d3z•4h ago
"Let's do the math. "

This phrase is usually followed by some, you know...Math?

Gigachad•4h ago
The article is slop. That’s just a phrase ChatGPT uses a lot.
raincole•4h ago
> In a Nutshell

> AI tools aren't perfect yet. They sometimes make mistakes, and they can't always understand what you are trying to do. But they're getting better all the time, In the future, they will be more powerful and helpful. They'll be able to understand your code even better, and they'll be able to generate even more creative ideas.

From another post on the same site. [0]

Yup, slop.

[0]: https://utkarshkanwat.com/writing/review-of-coding-tools/

cmsefton•3h ago
2015? The title should be 2025.
RustyRussell•3h ago
2015? Title is correct, this is a typo
tomhow•3h ago
Sorry about that, my fault, moderating from my phone.
kerkeslager•2h ago
Real question: what's the best way to short AI right now?
arealaccount•1h ago
Just short any of the publicly traded companies with AI based valuations? Nvida, Meta? Seems like an awful idea but I'm often wrong.
actinium226•41m ago
Very nice article. The point about mathematical reliability is interesting. I generally agree with it, but humans aren't 100% reliable, or even 99% reliable, so how do we manage to create things like the Linux kernel or the Mars landers without AI? Clearly we have some sort of goal-based self-correction mechanism. I wonder if there's research into AI on that thread?
an0malous•31m ago
> Clearly we have some sort of goal-based self-correction mechanism.

Humans can try things, learn, and iterate. LLMs still can't really do the second thing, you can feed back an error message into the prompt but the learning isn't being added to its weights so its knowledge doesn't compound with experience like it does for us.

I think there are still a few theoretical breakthroughs needed for LLMs to achieve AGI and one of them is "active learning" like this.

airstrike•24m ago
100% and it seems like we need a whole new architecture to get there, because right now training a model takes so much time.

At the risk of making a terrible analogy, right now we're able to "give birth" to these machines after months of training, but once they're born, they can't really learn. Whereas animals learn something new every day, got to sleep, clean up their memories a bit, deleting some, solidifying others, and waking up with an improved understanding of the world.

psadri•22m ago
You could instruct the LLM to formulate a “lesson” based on the error and add this to the tool instructions for future runs.
hannofcart•27m ago
> Let's do the math. If each step in an agent workflow has 95% reliability, which is optimistic for current LLMs,then: 5 steps = 77% success rate 10 steps = 59% success rate 20 steps = 36% success rate Production systems need 99.9%+ reliability.

(End quote)

Isn't this just wrong? Isn't the author conflating accuracy of LLM output in each step to accuracy of final artifact which is a reproducible deterministic piece of code?

And they're completely missing that a person in the middle is going to intervene at some point to test it and at that point the output artifact's accuracy either goes to 100% or the person running the agent would backtrack.

Either am missing something or this does not seem well thought through.

hungryhobbit•21m ago
Did you even finish the article? The end is all about the trade-off of when "a person in the middle is going to intervene".

In fact, the point of the whole article isn't that AI doesn't work; to the contrary, it's that long chains of (20+) actions with no human intervention (which many agentic companies promise) don't work.

coliveira•19m ago
He's not wrong. The numbers are too pessimistic, however when building software the numbers don't need to be as high for a complete disaster to happen. Even if just 1% of the code is bad, it is still very difficult to make this work.

And you mention testing, which certainly can be done. But when you have a large product and the code generator is unreliable (which LLMs always are), then you have to spend most of your time testing.

alpha_squared•15m ago
One thing I'll add that isn't touched on here is about context windows. While not "infinite", humans have a very large context window for problems they're specialized in solving. Models can often overcome their context window limitations by having larger and more diverse training sets, but that still isn't really a solution to context windows.

Yes, I get the context window increases over time and that for many purposes it's already sufficient enough, but the current paradigm forces you to compress your personal context into a prompt to produce a meaningful result. In a language as malleable as English, this doesn't feel like engineering so much as it feels like incantations and guessing. We're losing so, so much by skipping determinism.

XMLUI

https://blog.jonudell.net/2025/07/18/introducing-xmlui/
73•mpweiher•1h ago•32 comments

AICodingHorrors – The price of AI-assisted coding

https://aicodinghorrors.com/
18•cratermoon•36m ago•4 comments

Coding with LLMs in the summer of 2025 – an update

https://antirez.com/news/154
116•antirez•4h ago•79 comments

The old Caveman Chemistry website (1996-2000)

https://cavemanchemistry.com/oldcave/
11•marcodiego•1h ago•0 comments

A Tour of Microsoft's Mac Lab (2006)

https://davidweiss.blogspot.com/2006/04/tour-of-microsofts-mac-lab.html
91•ingve•5h ago•12 comments

LLM architecture comparison

https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison
179•mdp2021•8h ago•12 comments

Async I/O on Linux in databases

https://blog.canoozie.net/async-i-o-on-linux-and-durability/
132•jtregunna•9h ago•49 comments

Java was not underhyped in 1997 (2021)

https://dylanbeattie.net/2021/07/01/java-is-criminally-underhyped.html
29•SerCe•3d ago•14 comments

Group Behind Steam Censorship Policies Have Powerful Allies

https://web.archive.org/web/20250719204151/https://www.vice.com/en/article/group-behind-steam-censorship-policies-have-powerful-allies-and-targeted-popular-games-with-outlandish-claims/
29•davikr•34m ago•2 comments

A human metaphor for evaluating AI capability

https://mathstodon.xyz/@tao/114881418225852441
87•bertman•7h ago•7 comments

US signals intention to rethink job H-1B lottery

https://www.theregister.com/2025/07/20/h_1b_job_lottery/
23•rntn•44m ago•6 comments

How Tesla is proving doubters right on why its robotaxi service cannot scale

https://www.aol.com/elon-gambling-tesla-proving-doubters-090300237.html
82•Bluestein•2h ago•86 comments

Digital vassals? French Government 'exposes citizens' data to US'

https://brusselssignal.eu/2025/07/digital-vassals-french-government-exposes-citizens-data-to-us/
23•ColinWright•3h ago•5 comments

Behind the ballistics of the 'explosive' squirting cucumber

https://phys.org/news/2025-07-ballistics-explosive-squirting-cucumber.html
33•PaulHoule•2d ago•4 comments

Show HN: MCP server for Blender that builds 3D scenes via natural language

https://blender-mcp-psi.vercel.app/
93•prono•9h ago•33 comments

Show HN: ggc – A terminal-based Git CLI written in Go

https://github.com/bmf-san/ggc
36•bmf-san•4d ago•27 comments

Dual interfacial H-bonding-enhanced deep-blue hybrid copper–iodide LEDs

https://www.researchsquare.com/article/rs-4114691/v1
5•gnabgib•3d ago•1 comments

I'm betting against AI agents, despite building them

https://utkarshkanwat.com/writing/betting-against-agents/
227•Dachande663•6h ago•132 comments

How the 'Minecraft' Score Became Big Business for Its Composer

https://www.billboard.com/pro/how-minecraft-score-became-big-business-for-composer/
42•tunapizza•4d ago•13 comments

Hungary's oldest library is fighting to save books from a beetle infestation

https://www.npr.org/2025/07/14/nx-s1-5467062/hungary-library-books-beetles
157•smollett•3d ago•20 comments

Make Your Own Backup System – Part 1: Strategy Before Scripts

https://it-notes.dragas.net/2025/07/18/make-your-own-backup-system-part-1-strategy-before-scripts/
308•Bogdanp•19h ago•97 comments

The bewildering phenomenon of declining quality

https://english.elpais.com/culture/2025-07-20/the-bewildering-phenomenon-of-declining-quality.html
297•geox•7h ago•497 comments

Robot metabolism: Toward machines that can grow by consuming other machines

https://www.science.org/doi/10.1126/sciadv.adu6897
22•XzetaU8•7h ago•12 comments

Death by AI

https://davebarry.substack.com/p/death-by-ai
437•ano-ther•1d ago•176 comments

Nobody knows how to build with AI yet

https://worksonmymachine.substack.com/p/nobody-knows-how-to-build-with-ai
442•Stwerner•23h ago•346 comments

I tried vibe coding in BASIC and it didn't go well

https://www.goto10retro.com/p/vibe-coding-in-basic
141•ibobev•4d ago•147 comments

Beyond Meat fights for survival

https://foodinstitute.com/focus/beyond-meat-fights-for-survival/
138•airstrike•15h ago•334 comments

How to run an Arduino for years on a battery (2021)

https://makecademy.com/arduino-battery
83•thunderbong•3d ago•22 comments

Local LLMs versus offline Wikipedia

https://evanhahn.com/local-llms-versus-offline-wikipedia/
286•EvanHahn•22h ago•168 comments

Roman Roads Research Association (UK)

https://www.romanroads.org/index.html
26•countrymile•8h ago•5 comments