AI agents get office tasks wrong around 70% of time, and many aren't AI at all

https://www.theregister.com/2025/06/29/ai_agents_fail_a_lot/

36•rntn•5h ago

Comments

santana16•4h ago

The question is can this change over the next 3 years or are we plateaued... Looks to me that the progress is slowing down.

parpfish•4h ago

Prediction:

Once the industry has to acknowledge a plateau, there’ll be a pivot to “human in the loop” products. But instead of discussing their inability to do full automation, they’ll play up the ethics of preserving jobs for humans

Lerc•3h ago

Looking at just the Gemini results as an example, it's not surprising some are predicting decent results fairly soon.

    Gemini-1.5-Pro (3.4 percent)
    Gemini-2.0-Flash (11.4 percent)
    Gemini-2.5-Pro (30.3 percent)

I'm not sure if plateauing is the right interpretation of what is happening right now. The arrival of LLMS was like the doors opening on some massive Black Friday sale, you see news footage of people rushing in in a wave scrambling to find the best goodies and some crazed mother jumping up and down with a HotChix Viking Warrior edition like it's the holy grail. It is attention grabbing but not the real benefit of having doors that open.

I'll step out of the metaphor now before it gets stretched too far, but the point is tranformers and LLMs were a real advancement. From reading papers, it seems like things are still advancing at a steady pace. The things that advance at a steady pace are just not the things that grab the headlines. Frequently the things that do grab the headlines do not contribute to to the overall advancement.

To use another metaphor. The crashing waves reach you because the tide is coming in, you notice the waves first, but the waves are not the tide. The wave receding does not mean the tide is going out.

baobun•4h ago

> "Many vendors are contributing to the hype by engaging in 'agent washing' – the rebranding of existing products, such as AI assistants, robotic process automation (RPA) and chatbots, without substantial agentic capabilities," the firm says. "Gartner estimates only about 130 of the thousands of agentic AI vendors are real."

I'd expect 100% of "agentic AI" to be hype. It's a meaningless term because almost any long-running software with an execution engine can qualify. What really differentiates "real agentic" from "slapped IFTT and an LLM together"?

admjs•4h ago

Think of execution steps as nodes in a graph. IFTT has pre-defined execution paths through the graph, its determinist. Agents design the execution path on the fly for the most contextually appropriate solution, it's non-deterministic. Both are state machines and DAGs.

cube00•3h ago

> Agents design the execution path on the fly for the most contextually appropriate solution

Sounds great but given the exact same context they don't design the same path each time.

baobun•3h ago

Sure, but the point is that AI-driven data pipelines with executive parts have been commonplace for well over a decade. What's a network of trading bots instructed by ML-models fed by market data and various signals if not agentic AI already? Price-settings systems for airline tickets? Spambots?

The word "agentic" doesn't bring anything new. We've always been doing this.

digitcatphd•3h ago

All these arguments remind me of early LLMs. In fact, most of the arguments are nearly identical. Heck, even Bitcoin back 10 years ago was demonized and now governments are pushing for national reserves.

If you have used GPT o3, you have used Agentic models. If you use Claude Code or Cursor, you have used Agentic models. If you have used Claude + MCP, you have used agentic models.

This isn't theoretical - its already being applied. In fact, Langchain just hosted an entire conference to showcase how Fortune 100 companies were using LangGraph and I can assure you if they could have used a more simple architecture they would have.

If you wait around until everyone considers it to be acceptable in the mainstream all the good opportunities to create disruptive tech will be gone. So please - keep publishing this luddite trash and throw shade on the sidelines. I will keep doing agent research and building with it. We will see in 10 years who was wrong I guess.

https://www.qodo.ai/blog/building-agentic-flows-with-langgra...

octopoc•2h ago

I agree with the overall point but I do think there's a better approach than langchain. Langchain is a framework, not a library. I think a library approach is better at this point because libraries can be composed, which supports fundamental shifts in the structure of the code using the library. Frameworks can't be composed--that's the whole point of having a framework. Agentic patterns are so early that frameworks are limiting and we should be choosing libraries instead.

For example, there was a paper recently (unfortunately I can't remember which it was) that talked about how if you support pausing generation and allowing the human to edit the generated text, then resuming, you can get much better results than if you force the human to respond to the AI as a separate message. That's something that is more difficult for a framework to adapt to than for a library to adapt to.

I'm building an agentic framework that is more on the library side of things, and it necessarily leaves control of the iteration in the hands of the developer. The developer literally writes a while loop to run the agent, offloading things like RAG and tool calling to different pieces of the library. But every part of the iteration is controlled by the developer's agent source code.

digitcatphd•2h ago

Makes sense, yes, I think a spectrum of tools will be useful depending on the build. N8N, LangGraph, and working with primitives all have their use case and then tools built on each will be useful much like the landscape of RAG builds have evolved.

Macha•4h ago

In some ways, the big success of AI agents is getting people to invest in and/or pay for "it's sometimes right" compared to previous expectations that if a system is incorrect that's a bug that needs fixing yesterday.

cube00•3h ago

It's funny watching the grifters try to sell "but humans make mistakes too"

However the reason I'm using a computer is because I want repeatable behaviour.

If there's a bug, I want to fail every single time so I can fix it (yeah I know, buffer overruns and it's not always repeatable) but least I can squash the bug for good if I can actually find where I've stuffed up.

Side note: Happening less with Rust now protecting me from myself which pleases me, at least the failures are more repeatable if I can get the inputs right.

Where as now the "prompt engineers" try to fluff up the prompt with more CAPS and "CRITICAL" style wording to get it over the line today and they know full well it'll just break in some other weird way tomorrow.

Hopefully a few more cases of LLMs going rouge will educate them since they refuse to learn from Air Canada or Cursor Support.

mark_l_watson•4h ago

I think that Google Gemini has met my criterion for being an effective agent for Workspace data for quite a while. The problem is that after the novelty wears off, I don’t use it anymore.

I have written about 20 AI books in the last 30 years and I am nkw finding myself to be a mild AI skeptic. My current AI use case is as a research assistant, and that is about all I use Gemini and ChatGPT for any longer. (Except for occasional code completion.) And, think about all the energy we are using for LLM inference!

thunky•1h ago

> Google Gemini has met my criterion for being an effective agent for Workspace data for quite a while

> My current AI use case is as a research assistant

> I am now finding myself to be a mild AI skeptic

Seems contradictory.

darkxanthos•4h ago

I agree with the idea that true agentic AI is far from perfect and is overused in a lot of low or negative ROI contexts... I'm not convinced that where the ROI is there, even if the error rate is high, that it isn't still worthwhile.

Augmented coding as Kent beck puts it is filled with errors but more and more people are starting to find to be a 2x+ improvement for most cases.

People are spending too much time arguing that the the extreme hype is extremely hyped and what can't be done and aren't looking at the massive progress in terms of what can be done.

Also no one I know uses any of the models in the article at this point. They called out a 50% improvement in models spaced 6 months apart... that's also where some of the hype comes from.

FlyingSnake•4h ago

> many aren’t AI at all

Reminds me of this classic: AI == Actually Indian

https://www.businesstoday.in/technology/news/story/700-india...

Edit: this is false and has been debunked. The real story is in the child comment.

ebiester•3h ago

According to multiple outlets including Gergely Orosz, this was incorrect reporting. https://blog.pragmaticengineer.com/builder-ai-did-not-fake-a...

This has impact on those careers so it’s worth getting right.

FlyingSnake•3h ago

Thanks for providing context. I’ve edited my comment and added a disclaimer to correct it.

upghost•3h ago

> When Captain Picard says in Star Trek: The Next Generation, "Tea, Earl Grey, hot," that's agentic AI, translating the voice command and passing the input for the food replicator. When astronaut Dave Bowman orders the HAL 9000 computer to, "Open the pod bay doors, HAL," that's agentic AI too.

More like GIR from Invader Zim.

ReptileMan•3h ago

For some reason all discussions of AI shortcomings remind me of this joke

A dog walks into a butcher shop with a purse strapped around his neck. He walks up to the meat case and calmly sits there until it's his turn to be helped. A man, who was already in the butcher shop, finished his purchase and noticed the dog. The butcher leaned over the counter and asked the dog what it wanted today. The dog put its paw on the glass case in front of the ground beef, and the butcher said, "How many pounds?"

The dog barked twice, so the butcher made a package of two pounds ground beef.

He then said, "Anything else?"

The dog pointed to the pork chops, and the butcher said, "How many?"

The dog barked four times, and the butcher made up a package of four pork chops.

The dog then walked around behind the counter, so the butcher could get at the purse. The butcher took out the appropriate amount of money and tied two packages of meat around the dog's neck. The man, who had been watching all of this, decided to follow the dog. It walked for several blocks and then walked up to a house and began to scratch at the door to be let in. As the owner opened the door, the man said to the owner, "That's a really smart dog you have there."

The owner said, "He's not that smart. This is the second time this week he forgot his key."

martinald•3h ago

I feel people are really underestimating the power of agents. There are a lot of drawbacks right now, I'd say the main ones are speed and context window size (and related, cost). Frontier LLMs are still slow as hell, it reminds me of dial up internet. I think it's worth imagining a world where LLMs have 1000x the tok/s and (at least) 1000x the context/message length, because I don't think that is that far away and developing for that.

For code, this would allow agents to build a web app, unit/e2e test it, take visual screenshots of the app and iterate on the design, etc. And do 50+ iterations of this at once. So you get 50 versions of the app in a few minutes with no input, with maybe another agent that ranks them and gives you the top 5 to play around with. Same for new features after you've built the initial version.

Right now they are so slow and have limited context windows that this isn't really feasible. But it would just require a few orders of magnitude improvements in context windows (at least) and speed (ideally, to make the cost more palatable).

I feel you can 'brute force' quality to a certain extent (even assuming no improvement in model quality) if you can keep a huge context window going (to avoid it going round in circles) and have multiple variations in parallel.

cube00•3h ago

You can't brute force your way out of a minefield of random hallucinations that change on every execution.

ptx•46m ago

> Gartner still expects that by 2028 about 15 percent of daily work decisions will be made autonomously by AI agents, up from 0 percent last year.

Companies hoping to automate decision-making had better keep in mind that article 22 of the GDPR [1] requires them, specifically in the case of automated decision-making, to "implement suitable measures to safeguard the data subject’s rights and freedoms and legitimate interests, at least the right to obtain human intervention on the part of the controller, to express his or her point of view and to contest the decision."

[1] https://gdpr-info.eu/art-22-gdpr/

I made my VM think it has a CPU fan

Show HN: Octelium – FOSS Alternative to Teleport, Cloudflare, Tailscale, Ngrok

We accidentally solved robotics by watching 1M hours of YouTube

Using the Internet without IPv4 connectivity

Bloom Filters by Example

Beyond the Hook: A Technical Deep Dive into Modern Phishing Methodologies

The Medley Interlisp Project: Reviving a Historical Software System [pdf]

The Unsustainability of Moore's Law

Scientists Retrace 30k-Year-Old Sea Voyage, in a Hollowed-Out Log

More on Apple's Trust-Eroding 'F1 the Movie' Wallet Ad

Performance Debugging with LLVM-mca: Simulating the CPU

Implementing fast TCP fingerprinting with eBPF

MCP: An (Accidentally) Universal Plugin System

Solving `Passport Application` with Haskell

Sequence and first differences together list all positive numbers exactly once

What LLMs Know About Their Users

The Death of the Middle-Class Musician

Why Go Rocks for Building a Lua Interpreter

Brad Woods Digital Garden

Schizophrenia is the price we pay for minds poised near the edge of a cliff

Improving River Simulation

We ran a Unix-like OS on our home-built CPU with a home-built C compiler (2020)

Engineered Addictions

Show HN: A different kind of AI Video generation

BusyBeaver(6) Is Quite Large

JavaScript Trademark Update

Magnetic Tape Storage Technology: usage, history, and future outlook

What UI first distinguished radio buttons from checkboxes with circles/squares?

An Indoor Beehive in My Bedroom Wall

Life of an inference request (vLLM V1): How LLMs are served efficiently at scale

I made my VM think it has a CPU fan

Show HN: Octelium – FOSS Alternative to Teleport, Cloudflare, Tailscale, Ngrok

We accidentally solved robotics by watching 1M hours of YouTube

Using the Internet without IPv4 connectivity

Bloom Filters by Example

Beyond the Hook: A Technical Deep Dive into Modern Phishing Methodologies

The Medley Interlisp Project: Reviving a Historical Software System [pdf]

The Unsustainability of Moore's Law

Scientists Retrace 30k-Year-Old Sea Voyage, in a Hollowed-Out Log

More on Apple's Trust-Eroding 'F1 the Movie' Wallet Ad

Performance Debugging with LLVM-mca: Simulating the CPU

Implementing fast TCP fingerprinting with eBPF

MCP: An (Accidentally) Universal Plugin System

Solving `Passport Application` with Haskell

Sequence and first differences together list all positive numbers exactly once

What LLMs Know About Their Users

The Death of the Middle-Class Musician

Why Go Rocks for Building a Lua Interpreter

Brad Woods Digital Garden

Schizophrenia is the price we pay for minds poised near the edge of a cliff

Improving River Simulation

We ran a Unix-like OS on our home-built CPU with a home-built C compiler (2020)

Engineered Addictions

Show HN: A different kind of AI Video generation

BusyBeaver(6) Is Quite Large

JavaScript Trademark Update

Magnetic Tape Storage Technology: usage, history, and future outlook

What UI first distinguished radio buttons from checkboxes with circles/squares?

An Indoor Beehive in My Bedroom Wall

Life of an inference request (vLLM V1): How LLMs are served efficiently at scale

AI agents get office tasks wrong around 70% of time, and many aren't AI at all

Comments