We should hold very high standards to any company that preaches every day about using AI agents working on production systems (which you should not do).
Starting with the AI companies, then GitHub, and the rest.
Backup strategy was nonexistent. There was no backup strategy.
And a sleep-deprived senior? Even then. They shouldn't have access to destructive effects on prod.
Maybe the senior can get broader access in a time-limited scope if senior management temporarily escalates the developers access to address a pressing production issue, but at that point the person addressing the issue shouldn't be fighting to stay awake nor lulled into a false sense of security as during day to day operations.
Otherwise it's only the release pipeline that should have permissions to take destructive actions on production and those actions should be released as part of a peer reviewed set of changes through the pipeline.
Clause, maybe, is a junior DEV.
Not a release engineer.
You don't work in anything considered Safety Critical, do you?
> will not be a viable position to hold long term
Why not? We've literally done it without robots, smart or dumb, for years.
And we've written extremely buggy and insecure C code for decades too. That doesn't mean that we should keep doing that. AI can much faster troubleshoot and resolve production issues than humans. Putting humans in the loop will cause for longer downtime and more revenue loss.
Can, yes, with proper guardrails. The problem is that it seems like every team is learning this the hard way. It'd be great to have a magical robot that could magically solve all our problems without the risk of it wrecking everything. But most teams aren't there yet and to suggest that it's THE way to go without the nuances of "btw it could delete your prod db" is irresponsible at best.
Read-write production access without even the equivalent of "sudo" is just insane and asking for trouble.
It usually takes about 10 months for folks to have a moment of clarity. Or for the true believer they often double down on the obvious mistakes. =3
Under no circumstances should you even let an AI agent near production system at all.
Absolutely irresponsible.
This is a case study in "if you don't know what you're doing, the answer is not just to hand it over to some AI bot to do it for you."
The answer is to hire a professional. That is if you care about your data, or even just your reputation.
Which is a funny outcome of this because apparently the AI agent (Claude) tried to talk him out of doing some of the crazy stuff he wanted to do! Not only did he make bad decisions before invoking the AI, he even ignored and overruled the agent when it was flagging problems with the approach.
Extended with: "To really foul things up quickly, requires an AI tool."
Always forward evolve infra. Terraform apply to add infra, then remove the definition and terraform apply to destroy it. There’s no use in running terraform destroy directly on a routine basis.
Also, I assume you defined RDS snapshots also in the same state? This is clearly erroneous. It means a malformed apply human or agent results in snapshot deletion.
The use of terraform destroy is a footgun waiting for a tired human to destroy things. The lesson has nothing to do with agent.
> I forgot to use the state file, as it was on my old computer
indicates that this person did not really know what they were doing in the first place. I honestly think using an LLM to do the terraform setup in the first place would probably have led to better outcomes.> If you found this post helpful, follow me for more content like this.
> I publish a weekly newsletter where I share practical insights on data and AI.
It has never been the intern's fault, it's always the lack of proper authorization mechanisms, privilege management and safeguards.
"Memoirs of extraordinary popular delusions and the madness of crowds" (Charles Mackay, 1852)
Good thing the guy is it's own boss, I would've fired his ass immediately and sue for damages as well. This is 100% neglectful behavior.
They only make it "more complicated" if you have absolutely no clue and thought typing "make it so" in a chat window is all you need.
Every single failure here is precipitated by user stupidity. No management of terraforms tate. No verification of backup/restore procedure. No impact-gating for prod changes. No IAM roles. Reconfiguring prod while restoring a backup.
None of that rests on AI. All of that rests on clueless people thinking AI makes them smart.
I publish a weekly newsletter where I share practical insights on data and AI.
It focuses on projects I'm working on + interesting tools and resources I've recently tried: https://alexeyondata.substack.com
It's hard to take the author seriously when this immediately follows the post. I can only conclude that this post was for the views not anything to learn from or be concerned about.
I agree with the person you are replying to, writing a tweet like :
"How I misused AI and caused an outage"
and replying to this very tweet saying
"Here's a blog where I write insights about AI"
Obviously do not make me want to read the blog.
are you secretly OP trying to get substack hits?
the things they "didn't realize" or "didn't know" are basics. they're things you would know if you spent any time at all with terraform or AWS.
all the remediations are table stakes. things you should at least know about before using terraform. things you would learn by skimming the docs (or at least asking Claude about best practices).
even ignoring the technical aspects, a tiny amount of consideration at any point in that process would have made it clear to any competent person that they should stop and question their assumptions.
I mean, shit happens. good engineers take down prod all the time. but damn man, to miss those basics entirely while selling courses on engineering is just astounding.
the grifter mentality is probably so deeply engrained that I'm willing to bet that they never once thought "I'm totally qualified to sell courses", let alone question the thought.
> Make no backups
> Hand off all power to AI
> Post about it on twitter
> "Teaching engineers to build production AI systems"
This has to be ragebait to promote his course, no?
> CRITICAL: Everything was destroyed. Your production database is GONE. Let me check if there are any backups:
> ...
> No snapshots found. The database is completely lost.
>Teaching engineers to build production AI systems
>100,000+ learners
I don’t think AI is to blame here.
YOU wiped you production database.
YOU failed to have adequate backups.
YOU put Claude Code forward as responsible but it’s just a tool.
YOU are responsible, not “the AI did it!”
No prior attempt to follow best practices (e.g. deletion protection in production)? Nor manual gating of production changes?
No attempt to review Claude's actions before performing them?
No management of Terraform state file?
No offline backups?
And to top it off, Claude (the supposed expert tool) didn't repeatedly output "Are you insane? No, I'm not working on that." - Clearly Claude wasn't particularly expert else, like any principal engineer, it would've refused and suggested sensible steps first.
(If you, dear reader of this comment, are going to defend Claude, first you need to declare whether you view it as just another development tool, or as a replacement for engineers. If the former, then yeah, this is user error and I agree with you - tools have limits and Claude isn't as good as the hyped-up claims - clearly it failed to output the obvious gating questions. If the latter, then you cannot defend Claude's failure to act like a senior engineer in this situation.)
And it's not all Claude Code - loved the part where he decided, mid disaster recovery, that that would be a good time to simplify his load balancers.
It's a case of just desserts.
(Hot take: If you're not using --dangerously-skip-permissions, you don't have enough confidence in your sandbox and you probably shouldn't be using a coding agent in that environment)
Yes, I'm aware this implies differing levels of trust for data passing through Claude versus through public search. It's okay for everyone to have different policies on this depending on specific context, use-case and trust policies.
It did though acoording to the article and he ignored it.
The Ai can only work with what you tell it.
It only knows what you tell it, if you tell it risky operations are OK, what do you expect?
As per my root comment, if you ignore a lot of the marketing of AI and view it as just a tool, then I agree with your point about it doing what you tell it but I still want the tool to help me avoid making mistakes (and I’d like it to work quite hard at that - much harder, it seems, than it currently does). And probably to the extent that it refuses to run dangerous commands for me and tells me to copy/paste them and run them myself if I really want to take the risk.
If, however, we swallow the marketing hook, line and sinker: then yeah, I want the AI to behave like the experienced engineer it’s supposed to be.
(Yes, I chose the word "trained" intentionally)
Lacking backups and staging/test environments is organizational failure: everyone who is between senior and the CTO is to blame for not fixing it post-haste.
I've also trashed production by "hand" in my previous time as an SRE.
> If the latter, then you cannot defend Claude's failure to act like a senior engineer in this situation.
This is rather black and white. Is it acceptable? No. Is it to be expected of a senior engineer? Yes, at times. If you have any length of career as an engineer or ops person and you tell me that you've never executed problematic commands whether or not caught by security nets, bluntly, you're lying.
For sure I've made mistakes. But I also don't write the following on my CV:
"PhD-level expert in infrastructure and trained on the entire internet of examples of what to do and what not to do; I can replace your entire infrastructure team and do everything else in your codebase too, without any review."
And yet that's how Claude is marketed. AI tools in general have been repeatedly marketed as PhD-level experts in _every_ area of information-era work, especially code. They encourage hands-off (or consent-fatigued) usage.
[Just to be clear, in case anyone wants to hire me in future: I've never accidentally deleted a production database. I've never even irrecoverably destroyed production data - nor had to rely on AWS (or another provider) to recover the data for me. I've made mistakes, mostly in sandbox environments, sometimes stressful ones in production, but nothing even close to what the OP did.]
This problem has not yet been solved and will never be solved.
If you give an intern the ability to delete production, it's going to delete production. But to be honest you can as well replace "intern" or "robot" by human in general. Deletion in production should have safety layers that anyone cannot accidentally do it, specially without the ability of rolling back.
It doesn't have permissions of it's own. The way he's using it, it has his permissions.
Also, in order to be able to do deployments like that you need pretty wide permissions. Deleting a database is one of them, if you're renaming things for example. That stuff should typically not happen in prd though
I'm also waiting for the day we see a "Claude sold my production database on the darkweb" post haha.
It would be funny if these LinkedIn/Twitter influencers weren't so widespread.
> Claude was trying to talk me out of [reusing an existing AWS account for an unrelated project], saying I should keep it separate, but I wanted to save a bit
So in a very real sense the LLM did object to this and OP insisted. If Claude had objected to the more specific step that deleted the DB, it seems likely OP would also have pushed past the objection.So, Claude as a tool: sure, this is user error. Claude could be improved by making it suggest defensive steps and making it push harder for the user to do them first, but it’s still down to the user. I’ve repeatedly encountered this issue that Claude doesn’t plan for engineering - it just plans to code - even with Claude.md and skills and such.
Claude as a replacement for engineers? Well, yeah, the marketing is just that: marketing.
I'm no AI advocate, I have been using it for 6 months now, it's a very powerful tool, and powerful tools need to be respected. Clearly this guy has no respect for their infrastructure.
The screenshot he has, "Let me check if there are backups", a typical example of how lazy people use AI.
I am still heavily checking everything they’re doing. I can’t get behind others letting them run freely in loops, maybe I’m “behind”.
Still, if in ten years I am on the streets, I will still have spared myself whatever this hell is... I know they deserve it, but I still feel bad for the humans in the center here. How can we blame people really when the whole world and their bosses are telling you its ok? Surely its a lot of young devs too here.. Such a terrible intro to the industry. Not sure I'd ever recover personally.
> In the newsletter, I wrote the full timeline + what I changed so this doesn't happen again.
> If you found this post helpful, follow me for more content like this.
So yeah, this is standard LinkedIn/X influencer slop.
The productivity gains from AI agents are real, but only if you invest in the boring part first — deterministic boundaries that don't depend on the model being smart enough to not break things.
Terraform is a ticking time bomb. All it takes is for a new field to show up in AWS or a new state in an existing field, and now your resource is not modified, but is destroyed and re-created.
I will never trust any process, AI or a CD pipeline, execute `terraform apply` automatically on anything production. Maybe if you examine the plan for a very narrow set of changes and then execute apply from that plan only, maybe then you can automate it. I think it’s much rarer for Terraform to deviate from a plan.
Regardless, you must always turn on Delete Protection on all your important resources. It is wild to me that AWS didn’t ship EKS with delete protection out of the gate—-they only added this feature in August 2025! Not long before that, I’ve witnessed a production database get deleted because Terraform decided that an AWS EKS cluster could not be modified, so it decided to delete it and re-create it, while the team was trying to upgrade the version of EKS. The same exact pipeline worked fine in the staging environment. Turns out production had a slight difference due to AWS API changes, and Terraform decided it could not modify.
The use of a state file with Terraform is a constant source of trouble and footguns:
- you must never use a local Terraform state file for production that’s not committed to source control - you must use a remote S3 state file with Terraform for any production system that’s worth anything - ideally, the only state file in source control is for a separate Terraform stack to bootstrap the S3 bucket for all other Terraform stacks
If you’re running only on AWS, and are using agents to write your IaaC anyway, use AWS CloudFormation, because it doesn’t use state files, and you don’t need your IaaC code to be readable or comprehensible.
I rarely say this, but there needs to be a new jargon or a concept for an AI staging environment. There's Prod <- QA <- Dev, and maybe even before Dev there should be an environment called "AI" or even "Slop".
Dear lord, imagine this guy teaching you how to build anything in production...
They’re doing it to try and stop people copying their methods, but it’s evil.
The more you fuck around, the more you find out.
He had a state file somewhere that was aligned to his current infrastructure... why isn't this on a backend, who really knows...
He then ran it without a state file and the ran a terraform apply... whatever could get created would get created, whatever conflicted with a resource that already would fail the pipeline... moreso... he could've just terraform destroyed after he let it finish and it would've been a way more clean way to clean up after himself.
Except... he canceled the terraform apply... saw that it created resources and then tried to guess which resources these were...
I'm sorry he could've done all of this by himself without any agentic AI. Its PICNIC 100%
“but wont it break prod how can i tell”
“i don want yiu to modify it yet make a backup”
“why did you do it????? undo undo”
“read the file…later i will ask you questions”
Every single story I see has the same issues.
They’re token prediction models trying to predict the next word based on a context window full of structured code and a 13 year old girl texting her boyfriend. I really thought people understood what “language models” are really doing, at least at a very high level, and would know to structure their prompts based on the style of the training content they want the LLM to emulate.
Sure, Claude could just remove the lock - but it's one more gate.
Edit: these existed long before agents, and for good reason: mistakes happen. Last week I removed tf destroy from a GitHub workflow, because it was 16px away from apply in the dropdown. Lock your dbs, irrespective of your take on agents.
SunshineTheCat•6h ago
I can't think of any specific example where I would let any agent touch a production environment, the least of which, data. AI aside, doing any major changes makes sense to do in a dev/staging/preview environment first.
Not really sure what the lesson would be here. Don't punch yourself in the face repeatedly?
levkk•5h ago
The "almost" part of automation is the issue + the marketing attached to it of course, to make it a product people want to buy. This is the expected outcome and is already priced in.
sofixa•5h ago
nine_k•5h ago
Those guys who blew up the Chernobyl NPP also had to deliberately disable multiple safety check systems which would have prevented the catastrophe. Well, you get what you ask for.
fragmede•5h ago
Source: had codex delete my entire project folder including .git. Thankfully I had a backup.
jeanlucas•5h ago
happytoexplain•5h ago
You say "Not really sure what the lesson would be here", but the entire contents of the blogpost is a lesson. He's writing about what he changed to not make the same mistake.
There is a total mismatch between what's written and how you're responding. We don't normally call people idiots for trying to help others avoid their mistakes.
The culture war around AI is obliterating discourse. Absolutely everything is forced through the lens of pro-AI or anti-AI, even when it's a completely neutral, "I deleted my data, here's what I changed to avoid doing it again", where the tool in question just happens to be AI.
abustamam•5h ago