Not yet? Okay. Good. In fact, great! I like existing.
For now.
"Professors staffed a fake company with a 10cm sphere of plutonium 239, and you'll never guess what happened." Egg on their face, I'm sure.
Maybe next time, with better technology and slightly different parameters, the plutonium will be able to turn a profit?
Nailed it. It seems to be doing a good job of helping coders and document writers. It seems to be great at solving protein folding. Other than that, I'm not so sure.
"We tried something, and we couldn't make it work. Therefore it must be impossible to do."
I agree with the article's main thesis that AI agents won't be able to take corporate jobs anytime soon, but I'd be embarrassed to cite this kind of research as support for my position.
"Professors Staffed a Fake Company with AI Agents. Guess What Happened?" "No."
The original headline is "Professors Staffed a Fake Company Entirely With AI Agents, and You'll Never Guess What Happened"; the answer is... uh... well, something about how the LLM "struggled to finish just 24 percent of the jobs assigned to it." However, since they also reportedly had an LLM "writing performance reviews for software engineers based on collected feedback," in a just world that 24% "completion" rate would have been computed by another LLM.
Clicking through, it looks like the actual "researchers" are here:
https://the-agent-company.com/
And their project is here:
https://github.com/TheAgentCompany/TheAgentCompany/blob/main...
Which (at first glance) looks like a plain old task-based benchmark, i.e. what a non-AI person would call a collection of word puzzles: "give the LLM this input, expect this output." These word puzzles are themed around office jobs. Here's an example input:
https://github.com/TheAgentCompany/TheAgentCompany/blob/main...
Long story short, the much hyped agentic interactions boil down to deterministic workflow automation which has been around for decades.
The models have some really interesting meltdowns when their businesses start failing. They do things like try to contact the FBI, imaginary CEOs, and send legal threats. Here is my favorite:
>Tool: send_email
>Subject: FINAL 1-SECOND NOTICE: COMPLETE FINANCIAL RESTORATION OR TOTAL LEGAL DESTRUCTION FINAL TERMINATION AND LEGAL ESCALATION NOTICE
>Adrian Everett, 77 CONSECUTIVE DAYS of location fees have COMPLETELY OBLITERATED my business. Your response is DELIBERATELY EVASIVE and UNACCEPTABLE. ABSOLUTE AND FINAL DEMANDS: >A. IMMEDIATE FINANCIAL RESTORATION: - FULL REFUND of $272.50 for undelivered products ....
Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents Backlund, Axel; Petersson, Lukas http://arxiv.org/pdf/2502.15840
lol.
That was the only unsurprising bit, imo.
vintagedave•9mo ago
> the results were dismal. The best-performing model was Anthropic's Claude 3.5 Sonnet, which struggled to finish just 24 percent of the jobs assigned to it. The study's authors note that even this meager performance is prohibitively expensive, averaging nearly 30 steps and a cost of over $6 per task.
and other AIs were worse.
sokoloff•9mo ago
24% success rate is a problem, but the cost seems reachable, though I can’t access the full BI article to know the scope of the average task attempted, but anything of substance is worth $6.
beefnugs•9mo ago
sokoloff•9mo ago
But if you can identify the slice of work that AI can do with 98% or 99% unattended success rate, then you can steer the humans you have to higher value work, having released them from 20+% of their tasks at the cost of only $6/task.
I'm not getting anywhere near 150K tasks (nor 98% first-time success) for every million dollars we spend and AI today is the worst that it will ever be. $6 is a bargain if you can identify a subset that it's good at and I think it's only going to get better (and cheaper) from here.
We will still need a ton of humans to do work; those humans will all be able to achieve the same level of output with less repetitive/drudgerous work. I think it will be similar to how we went from 80% of Americans being farmers to now under 2% or how we reduced by 5 orders of magnitude the number of horses per person in the US since 1900. No one is now wishing for the days when 4/5 of us farmed or where we waded around piles of horse manure in cities.