Most mistakes (selling below cost, hallucinating Venmo accounts, caving to discounts) stem from missing tools like accounting APIs or hard constraints.
What's striking is how close it was to working. A mid-tier 2025 LLM (they didn't even use Sonnet 4) plus Slack and some humans nearly ran a physical shop for a month.
Good luck running anything where dependability on Claude/Anthropic is essential. Customer support is a black hole into which the needs of paying clients needs disappear. I was a Claude Pro subscriber, using primarily for assistance in coding tasks. One morning I logged in, while temporarily traveling abroad, and… I’m greeted with a message that I have been auto-banned. No explanation. The recourse is to fill out a Google form for an appeal but that goes into the same black hole into which all Anthropic customer service goes. To their credit they refunded my subscription fee, which I suppose is their way of escaping from ethical behaviour toward their customers. But I wouldn’t stake any business-critical choices on this company. It exhibits the same capricious behaviour that you would expect from the likes of Google or Meta.
https://stallman.org/articles/made-for-you.html
C-f Storolon
> ...in a world where larger fractions of economic activity are autonomously managed by AI agents, odd scenarios like this could have cascading effects—especially if multiple agents based on similar underlying models tend to go wrong for similar reasons.
This is a pretty large understatement. Imagine a business that is franchised across the country with each "franchisee" being a copy of the same model, which all freak out on the same day, accuse the customers of secretly working for the CIA and deciding to stop selling hot dogs at a profit and instead sell hand grenades at a loss. Now imagine 50 other chains having similar issues while AI law enforcement analysts dispatch real cops with real guns to the poor employees caught in the middle schlepping explosives from the UPS store to a stand in the mall.
I think we were expecting SkyNet but in reality the post-AI economy may just be really chaotic. If you thought profit-maximizing capitalist entrepreneurs were corrosive to the social fabric, wait until there are 10^10 more of them (unlike traditional entrepreneurs, there's no upper limit and there can easily be more of them than there are real people) and they not-infrequently act like they're in late stage amphetamine psychosis while still controlling your paycheck, your bank, your local police department, the military, and whatever is left that passes for the news media.
Deeper, even if they get this to work with minimal amounts of of synthetic schizophrenia, do we really want a future where we all mainly work schlepping things back and forth at the orders of disembodied voices whose reasoning we can't understand?
For example, I do not see the full system prompt anywhere, only an excerpt. But most importantly, they try to draw conclusions about the hallucinations in a weird vague way, but not once do they post an example of the notetaking/memory tool state, which obviously would be the only source of the spiralling other than the SP. And then they talk about the need of better tools etc. No, it's all about context. The whole experiment is fun, but terribly ran and analyzed. Of course they know this, but it's cooler to treat claudius or whatever as a cute human, to push the narrative of getting closer to AGI etc. Saying additional scaffolding is needed a bit is a massive understatement. Context is the whole game. That's like if a robotics company says "well, our experiment with a robot picking a tennis ball of the ground went very wrong and the ball is now radioactive, but with a bit of additional training and scaffolding, we expect it to compete in Wimbledon by mid 2026"
Similar to their "claude 4 opus blackmailing" post, they intentionally hid a bit the full system prompt, which had clear instructions to bypass any ethical guidelines etc and do whatever it can to win. Of course then the model, given the information immediately afterwards would try to blackmail. You literally told it so. The goal of this would to go to congress [1] and demand more regulations, specifically mentioning this blackmail "result". Same stuff that Sam is trying to pull, which would benefit the closed sourced leaders ofc and so on.
[1]https://old.reddit.com/r/singularity/comments/1ll3m7j/anthro...
I will say: it is incredibly cool we can even do this experiment. Language models are mind blowing to me. But nothing about this article gives me any hope for LLMs being able to drive real work autonomously. They are amazing assistants, but they need to be driven.
And that’s before we even get into online shops.
But yea, go ahead, see if an LLM can replace a whole e-commerce platform.
What this looks like is a startup where the marketing people are running things and setting pricing, without much regard for costs. Eventually they ran through their startup capital. That's not unusual.
Maybe they need multiple AIs, with different business roles and prompts. A marketing AI, and a financial AI. Both see the same financials, and they argue over pricing and product line.
Written on the back an envelope?
Way back when, we ran a vending machine at school as a project. Decide on the margin, buy in stock from the cash-and-carry, fill the machine, watch the money roll in.
Then we were robbed - twice! - the second time ended our project, the machine was too wrecked to be worthwhile repairing. The thieves got away with quite a lot of crisps and chocolate, and not a whole lot of cash (and what they did get was in small denomination coins), we made sure the machine was emptied daily...
In another post they mentioned a human rand the shop with pen and paper to get a a baseline (spoiler: human did better, no blunders)
seidleroni•2h ago
I wonder how long it will take frontier LLM's to be able to handle something like this with ease without it using a lot of "scaffolding".
roxolotl•1h ago