This is one of the clearest takes I've seen that starts to get me to the point of possibly being able to trust code that I haven't reviewed.
The whole idea of letting an AI write tests was problematic because they're so focused on "success" that `assert True` becomes appealing. But orchestrating teams of agents that are incentivized to build, and teams of agents that are incentivized to find bugs and problematic tests, is fascinating.
I'm quite curious to see where this goes, and more motivated (and curious) than ever to start setting up my own agents.
Question for people who are already doing this: How much are you spending on tokens?
That line about spending $1,000 on tokens is pretty off-putting. For commercial teams it's an easy calculation. It's also depressing to think about what this means for open source. I sure can't afford to spend $1,000 supporting teams of agents to continue my open source work.
I think people are burning money on tokens letting these things fumble about until they arrive at some working set of files.
I'm staying in the loop more than this, building up rather than tuning out
I don't take your comment as dismissive, but I think a lot of people are dismissing interesting and possibly effective approaches with short reactions like this.
I'm interested in the approach described in this article because it's specifying where the humans are in all this, it's not about removing humans entirely. I can see a class of problems where any non-determinism is completely unacceptable. But I can also see a large number of problems where a small amount of non-determinism is quite acceptable.
At that point, outside of FAANG and their salaries, you are spending more on AI than you are on your humans. And they consider that level of spend to be a metric in and of itself. I'm kinda shocked the rest of the article just glossed over that one. It seems to be a breakdown of the entire vision of AI-driven coding. I mean, sure, the vendors would love it if everyone's salary budget just got shifted over to their revenue, but such a world is absolutely not my goal.
And it might be the tokens will become cheaper.
Future better models will both demand higher compute use AND higher energy. We cannot underestimate the slowness of energy production growth and also the supplies required for simply hooking things up. Some labs are commissioning their own power plants on site, but this is not a true accelerator for power grid growth limits. You're using the same supply chain to build your own power plant.
If inference cost is not dramatically reduced and models don't start meaningfully helping with innovations that make energy production faster and inference/training demand less power, the only way to control demand is to raise prices. Current inference costs, do not pay for training costs. They can probably continue to do that on funding alone, but once the demand curve hits the power production limits, only one thing can slow demand and that's raising the cost of use.
Assuming 20 working days a month: that's 20k x 12 == 240k a year. So about a fresh grad's TC at FANG.
Now I've worked with many junior to mid-junior level SDEs and sadly 80% does not do a better job than Claude. (I've also worked with staff level SDEs who writes worse code than AI, but they offset that usually with domain knowledge and TL responsibilities)
I do see AI transform software engineering into even more of a pyramid with very few human on top.
> At that point, outside of FAANG and their salaries, you are spending more on AI than you are on your humans
You say
> Assuming 20 working days a month: that's 20k x 12 == 240k a year. So about a fresh grad's TC at FANG.
So you both are in agreement on that part at least.
which sounds more like if you haven't reached this point you don't have enough experience yet, keep going
At least that's how I read the quote
Edit: here's that section: https://simonwillison.net/2026/Feb/7/software-factory/#wait-...
I wonder what the security teams at companies that use StrongDM will think about this.
My hunch is that the thing that's going to matter is network effects and other forms of soft lockin. Features alone won't cut it - you need to build something where value accumulates to your user over time in a way that discourages them from leaving.
If I launch a new product, and 4 hours later competitors pop up, then there's not enough time for network effects or lockin.
I'm guessing what is really going to be needed is something that can't be just copied. Non-public data, business contracts, something outside of software.
You can see the first waves of this trend in HN new.
Heat death of the SaaSiverse
This is still the same problem -- just pushed back a layer. Since the generated API is wrong, the QA outcomes will be wrong, too. Also, QAing things is an effective way to ensure that they work _after_ they've been reviewed by an engineer. A QA tester is not going to test for a vulnerability like a SQL injection unless they're guided by engineering judgement which comes from an understanding of the properties of the code under test.
The output is also essentially the definition of a derivative work, so it's probably not legally defensible (not that that's ever been a concern with LLMs).
There are higher and lower leverage ways to do that, for instance reviewing tests and QA'ing software via use vs reading original code, but you can't get away from doing it entirely.
And “define the spec concretely“ (and how to exploit emerging behaviors) becomes the new definition of what programming is.
(and unambiguously. and completely. For various depths of those)
This always has been the crux of programming. Just has been drowned in closer-to-the-machine more-deterministic verbosities, be it assembly, C, prolog, js, python, html, what-have-you
There have been a never ending attempts to reduce that to more away-from-machine representation. Low-code/no-code (anyone remember Last-one for Apple ][ ?), interpreting-and/or-generating-off DSLs of various level of abstraction, further to esperanto-like artificial reduced-ambiguity languages... some even english-like..
For some domains, above worked/works - and the (business)-analysts became new programmers. Some companies have such internal languages. For most others, not really. And not that long ago, the SW-Engineer job was called Analyst-programmer.
But still, the frontier is there to cross..
>StrongDM’s answer was inspired by Scenario testing (Cem Kaner, 2003).
> You still have to have a human who knows the system to validate that the thing that was built matches the intent of the spec.
You don't need a human who knows the system to validate it if you trust the LLM to do the scenario testing correctly. And from my experience, it is very trustable in these aspects.
Can you detail a scenario by which an LLM can get the scenario wrong?
Then it seems like the only workable solution from your perspective is a solo member team working on a product they came up with. Because as soon as there's more than one person on something, they have to use "lossy natural language" to communicate it between themselves.
On the plus side, IMO nonverbal cues make it way easier to tell when a human doesn't understand things than an agent.
You can't 100% trust a human either.
But, as with self-driving, the LLM simply needs to be better. It does not need to be perfect.
You and I disagree on this specific point.
Edit: I find your comment a bit distasteful. If you can provide a scenario where it can get it incorrect, that’s a good discussion point. I don’t see many places where LLMs can’t verify as good as humans. If I developed a new business logic like - users from country X should not be able to use this feature - LLM can very easily verify this by generating its own sample api call and checking the response.
What I’m working on (open source) is less about replacing human validation and more about scaling it: using multiple independent agents with explicit incentives and disagreement surfaced, instead of trusting a single model or a single reviewer.
Humans are still the final authority, but consensus, adversarial review, and traceable decision paths let you reserve human attention for the edge cases that actually matter, rather than reading code or outputs linearly.
Until we treat validation as a first-class system problem (not a vibe check on one model’s answer), most of this will stay in “cool demo” territory.
We’ve spent years systematizing generation, testing, and deployment. Validation largely hasn’t changed, even as the surface area has exploded. My interest is in making that human effort composable and inspectable, not pretending it can be eliminated.
as a previous strongDM customer, i will never recommend their offering again. for a core security product, this is not the flex they think it is
also mimicking other products behavior and staying in sync is a fools task. you certainly won't be able to do it just off the API documentation. you may get close, but never perfect and you're going to experience constant breakage
From what I've heard the acquisition was unrelated to their AI lab work, it was about the core business.
As an example: imagine someone writing a data pipeline for training a machine learning model. Anyone who's done this knows that such a task involves lots data wrangling work like cleaning data, changing columns and some ad hoc stuff.
The only way to verify that things work is if the eventual model that is trained performs well.
In this case, scenario testing doesn't scale up because the feedback loop is extremely large - you have to wait until the model is trained and tested on hold out data.
Scenario testing clearly can not work on the smaller parts of the work like data wrangling.
In this model the spec/scenarios are the code. These are curated and managed by humans just like code.
They say "non interactive". But of course their work is interactive. AI agents take a few minutes-hours whereas you can see code change result in seconds. That doesn't mean AI agents aren't interactive.
I'm very AI-positive, and what they're doing is different, but they are basically just lying. It's a new word for a new instance of the same old type of thing. It's not a new type of thing.
The common anti-AI trope is "AI just looked at <human output> to do this." The common AI trope from the StrongDM is "look, the agent is working without human input." Both of these takes are fundamentally flawed.
AI will always depend on humans to produce relevant results for humans. It's not a flaw of AI, it's more of a flaw of humans. Consequently, "AI needs human input to produce results we want to see" should not detract from the intelligence of AI.
Why is this true? At a certain point you just have Kolmogorov complexity, AI having fixed memory and fixed prompt size, pigeonhole principle, not every output is possible to be produced even with any input given specific model weights.
Recursive self-improvement doesn't get around this problem. Where does it get the data for next iteration? From interactions with humans.
With the infinite complexity of mathematics, for instance solving Busy Beaver numbers, this is a proof that AI can in fact not solve every problem. Humans seem to be limited in this regard as well, but there is no proof that humans are fundamentally limited this way like AI. This lack of proof of the limitations of humans is the precise advantage in intelligence that humans will always have over AI.
Code must not be written by humans
Code must not be reviewed by humans
I feel like I'm taking crazy pills. I would avoid this company like the plague.Of course, you can't always tell the model what to do, especially if it is a repeated task. It turns out, we already solved this decades ago using algorithms. Repeatable, reproducible, reliable. The challenge (and the reward) lies in separating the problem statement into algorithmic and agentic. Once you achieve this, the $1000 token usage is not needed at all.
I have a working prototype of the above and I'm currently productizing it (shameless plug):
However - I need to emphasize, the language you use to apply the pattern above matters. I use Elixir specifically for this, and it works really, really well.
It works based off starting with the architect. You. It feeds off specs and uses algorithms as much as possible to automate code generation (eg. Scaffolding) and only uses AI sparsely when needed.
Of course, the downside of this approach is that you can't just simply say "build me a social network". You can however say something like "Build me a social network where users can share photos, repost, like and comment on them".
Once you nail the models used in the MVC pattern, their relationships, the software design is pretty much 50% battle won. This is really good for v1 prototypes where you really want best practices enforced, OSWAP compliant code, security-first software output which is where a pure agentic/AI approach would mess up.
> How do you clone the important parts of Okta, Jira, Slack and more? With coding agents!
This is what's going to gut-punch most SaaS companies repeatedly over the next decade, even if this whole build-out ultimately collapses in on itself (which I expect it to). The era of bespoke consultants for SaaS product suites to handle configuration and integrations, while not gone, are certainly under threat by LLMs that can ingest user requirements and produce functional code to do a similar thing at a fraction of the price.
What a lot of folks miss is that in enterprise-land, we only need the integration once. Once we have an integration, it basically exists with minimal if any changes until one side of the integration dies. Code fails a security audit? We can either spool up the agents again briefly to fix it, or just isolate it in a security domain like the glut of WinXP and Win7 boxen rotting out there on assembly lines and factory floors.
This is why SaaS stocks have been hammered this week. It's not that investors genuinely expect huge players to go bankrupt due to AI so much as they know the era of infinite growth is over. It's also why big AI companies are rushing IPOs even as data center builds stall: we're officially in a world where a locally-run model - not even an Agent, just a model in LM Studio on the Corporate Laptop - can produce sufficient code for a growing number of product integrations without any engineer having to look through yet another set of API documentation. As agentic orchestration trickles down to homelabs and private servers on smaller, leaner, and more efficient hardware, that capability is only going to increase, threatening profits of subscription models and large AI companies. Again, why bother ponying up for a recurring subscription after the work is completed?
For full-fledged software, there's genuine benefit to be had with human intervention and creativity; for the multitude of integrations and pipelines that were previously farmed out to pricey consultants, LLMs will more than suffice for all but the biggest or most complex situations.
Stuff comes in from an API goes out to a different API.
With a semi-decent agent I can build what took me a week or two in hours just because it can iterate the solution faster than any human can type.
A new field in the API could’ve been a two day ordeal of patching it through umpteen layers of enterprise frameworks. Now I can just tell Claude to add it, it’ll do it up to the database in minutes - and update the tests at the same time.
So much of enterprise IT nowadays is spent hammering or needling vendors for basic API documentation so we can write a one-off that hooks DB1 into ServiceNow that's also pulling from NewRelic just to do ITAM. Consultants would salivate over such a basic integration because it'd be their yearly salary over a three month project.
Now we can do this ourselves with an LLM in a single sprint.
That's a Pandora's Box moment right there.
rhrthg•3h ago
simonw•3h ago
My content revenue comes from ads on my blog via https://www.ethicalads.io/ - rarely more than $1,000 in a given month - and sponsors on GitHub: https://github.com/sponsors/simonw - which is adding up to quite good money now. Those people get my sponsors-only monthly newsletter which looks like this: https://gist.github.com/simonw/13e595a236218afce002e9aeafd75... - it's effectively the edited highlights from my blog because a lot of people are too busy to read everything I put out there!
I try to keep my disclosures updated on the about page of my blog: https://simonwillison.net/about/#disclosures
andersmurphy•2h ago
simonw•2h ago