I think eventually 4-8 will be collapsed behind a more capable layer that can handle this stuff on its own, maybe I tinker with MCP settings and granular control to minmax the process, but for the most part I shouldn't have to worry about it any more than I worry about how many threads my compiler is using.
I thought level 8 was a joke until Claude Code agent teams. Now I can't even imagine being limited to working with a single agent. We will be coordinating teams of hundreds by years end.
In my mind, MCP & Skills is inseparable part of chat interfaces for LLMs, not a distinct level.
If software engineering is enough of a solved problem that you can delegate it entirely to LLM agents, what part of it remains context-specific enough that it can’t be better solved by a general-purpose software factory product? In other words, if you’re a company that is using LLMs to develop non-AI software, and you’ve built a sufficient factory to generate that software, why don’t you start selling the factory instead of whatever you were selling before? It has a much higher TAM (all of software)
And when they will be fully dark factories, yes, what will happen is that a LOT of software companies will just disappear, they will be dis-intermediated by Codex/Claude Code.
Feels like K8s cult, overly focused on the cleverness of _how_ something is built versus _what_ is being built.
Youtube's of this world will not enjoy it, they will use rules of scale for billions of users.
Every Dashboard Chart, Security review system, Jira, ERP, CRM, LMS, chatbot, you name it. The problem that will win from a customization per smaller unit ( company, group of people or even more so an indvidual, like CEO, or CxO group) will win from such software.
The level 6 and and 7 is essentially death of enterprise software.
Enterprise software that you sell, or enterprise software you use internally?
The amount of self created, self used software in enterprises is staggering, that software will still exist, and still have a massive maintenance cost. So maybe we need a better definition of enterprise software here, like externally sold software? Also a huge amount of that software still has regulatory requirements, so someone will have to sign off on it. Maybe it will be internal certification, but very often there is separation of duties on things like that where it's easier to come from a different company.
If you could get a dark factory working when others don't have one, you can make much more money using it than however much you can make selling it
So far, we haven’t seen much to suggest that LLMs can (yet) replace sales and most of the related functions.
You still should talk to people yourself and be very careful with communicating AI slop, cold outreach and other things that piss off more people than they get into your funnel. But a lot of stuff around this can be automated or at least supported by LLMs.
Most of the success with sales is actually having something that people want to buy. That sounds easy. But it's actually the hardest part of selling software. Getting there is a bit of a journey.
I've built a lot of stuff that did not sell well. These are hard earned lessons. I see a lot of startups fall into this trap. You can waste years on product development and many people do. Until it starts selling, it won't matter. Sales is not a person you hire to do it for you: you have to be able to sell it yourself. If you can't, nobody else will be able to either. Founder sales is crucial. Step back from that once it runs smoothly, not before.
Use AI to your advantage here. We use it for almost everything. SEO, wording stuff on our website, competitor analysis, staying on top of our funnel, analyzing and sharpening our pitches, preparing responses to customer questions and demands, criticizing and roasting our own pitches and ideas, etc. Confirmation bias is going to your biggest blindspot. And we also use LLMs to work on the actual product. This stuff is a lot of work. If you can afford a ten person team to work on this, great. But many startups have to prove themselves before the funding for that happens. And when it does, hiring lots of people isn't necessarily a good allocation of resources given you can automate a lot of it now. I'd recommend to hire fewer but better people.
I mentioned sales and marketing but there’s a whole lot more as well. Basically, it involves creating an entire subsidiary. Perhaps the time will come when that can be mostly done by a team of AI agents, but right now that’s a big hurdle in practice.
What's the balance going to be between, 'connecting customers to product' and 'making differentiated product'?
In theory, if customers have perfect information ( ignoring a very large part of sales is emotional ), then the former part will disappear. However the rise of the internet, and perhaps AI agents shopping on your behalf, hasn't really made much of a dent there [1] - marketing, in all it's forms, is still huge business - and you could argue still expanding ( cf google ).
[1] Perhaps because of the huge importance of the emotional component. Perhaps also because in many areas of manufacturing you've reached a product plateau already - is there much space to make a better cup and plate?
Exactly where current companies compete, rent seeking, IP control, and legal machinations.
Hence you'll see a few giant lumbering dinosaurs control most of the market, and a few more nimble companies make successful releases until they either get crushed by, get snapped up by the larger companies, or become a large company themselves.
But in that scenario it's hard to see where the unwinding stops. What are these other companies doing and which parts of it actually need humans if the "agents" are that good? Marketing? No. Talking to customers? No. Support? No. Financial planning and admin? No. Manufacturing? Some, for now. Shipping physical goods? For now. What else...
At some point where even are your customers?
In relation to sales, there were two gems. For direct to consumer type companies - influencers are where it's at right now especially during bootstrap phase - and they were talking about trying to keep marketing budget under 20% of sales.
Another, who is mostly in the VC business, finds the best way to gain traction for his startups is to create controversy - ie anything to be talked about.
In both cases you are trying to be talked about - either by directly paying for people to do that, or by providing entertainment value so people talk about you.
You could argue that both of those activities are already been automated - and the nice thing about sales is there is that fairly direct feedback loop you can actively learn from.
The interesting thing though is that the bots are just cheaper versions of real human influencers. So nothing has changed aside from scale (and speed) - the underlying mechanisms of paying for word of mouth is the same as it's been for a long time.
They haven't branched off into making chips themselves. They keep their focus on selling the factories.
I think they haven't, because ASML itself doesn't have production lines. Every machine is one off. It even gets delivered with a team of engineers to keep it running.
The same probably holds true for software factories: the best ones are assembled by the smartest people (wielding AI in ways most of us don't). They are not in the business to produce software at scale, they are in the business to ensure others can do that using increasingly advanced software factories.
This relies on the premise that such a factory cannot produce a more advanced factory without significant human intervention (e.g. high ingenuity and/or lots of elbow grease). If this doesn't hold true, then we are in for some interesting times x100.
It's very powerful and agents can create dynamic microbenchmarks and evaluate what data structure to use for optimal performance, among other things.
I also have validation layers that trim hallucinations with handwritten linters.
I'd love to find people to network with. Right now this is a side project at work on top of writing test coverage for a factory. I don't have anyone to talk about this stuff with so it's sad when I see blog posts talking about "hype".
Would be happy to swap war stories.
<myhnusername>@gmail.com
I spend $140/mo on Anthropic + OpenAI subs and I use all my tokens all the time.
I've started spending about $100/week on API credits, but I'd like to increase that.
AI agents haven't yet figured out a way to do sales, marketing or customer support in a way that people want to pay them money.
Maybe that won't be necessary and instead the agent economy will be agents providing services for other agents.
Do you ever take the time to validate what one of the agents produces by going to the docs? Or is all debugging/changing of the code done via LLMs/agents?
I'm more like level 2 right now and genuinely curious if you feel like learning continues for you (besides with agentic orchestration, etc.) And if not, whether or not you think that matters.
> Do you ever take the time to validate what one of the agents produces by going to the docs? Or is all debugging/changing of the code done via LLMs/agents?
I divide my work into vibecoding PoC and review. Only once I have something working do I review the code. And I do so through intense interrogation while referencing the docs.
> I'm more like level 2 right now and genuinely curious if you feel like learning continues for you (besides with agentic orchestration, etc.)
Level 8 only works in production for a defined process where you don't need oversight and the final output is easy to trust.
For example, I made a code review tool that chunks a PR and assigns rule/violation combos to agents. This got a 20% time to merge reduction and catches 10x the issues as any other agent because it can pull context. And the output is easy to incorporate since I have a manager agent summarize everything.
Likewise, I'm working on an automatic performance tool right now that chunks code, assigns agents to make microbenchmarks, and tries to find optimization points. The end result should be easy to verify since the final suggestion would be "replace this data structure with another, here's a microbenchmark proving so".
Also would be interested in an example of "validation layers that trim hallucinations with handwritten linters" but understand if that's not something you can share. Either way, thanks for responding!
For code review, AI doesn't want to output well-formed JSON and oftentimes doesn't leave inline suggestions cleanly. So there's a step where the AI must call a script that validates the JSON and checks if applying the suggestion results in valid code, then fixes the code review comments until they do.
At this moment where we have human who just sit there before verify enough 9 after comas of error rates, the entire level conversation is dead. It's almost a binary state. Autonomous or not.
Similar happened with software levels. Even Level 2 was sci-fi 2 years ago, 1 year away from now anything bellow level 5 will be a joke except very regulated or billion users systems scale software.
https://www.danshapiro.com/blog/2026/01/the-five-levels-from...
> Look at your app, describe a sequence of changes out loud, and watch them happen in front of you.
The problem a lot of times is that either you don't know what you want, or you can't communicate it (and usually you can't communicate it properly because you don't know exactly what you want). I think this is going to be the bottleneck very soon (for some people, it is already the bottleneck). I am curious what are your thoughts about this? Where do you see that going, and how do you think we can prepare for that and address that. Or do you not see that to be an issue?
This is increasingly untrue with Opus 4.6. Claude Max gives you enough tokens to run ~5-10 agents continuously, and I'm doing all of my work with agent teams now. Token usage is up 10x or more, but the results are infinitely better and faster. Multi-agent team orchestration will be to 2026 what agents were to 2025. Much of the OP article feels 3-6 months behind the times.
Maybe it's just me, but I don't see the appeal in verbal dictation, especially where complexity is involved. I want to think through issues deliberately, carefully, and slowly to ensure I'm not glossing over subtle nuances. I don't find speaking to be conducive to that.
For me, the process of writing (and rewriting) gives me the time, space, and structure to more precisely articulate what I want with a more heightened degree of specificity. Being able to type at 80+ wpm probably helps as well.
Stream of consciousness typing for me is still slower and causes me to buffer and filter more and deliberately crafting a perfect prompt is far slower still.
LLMs are great at extracting the essence of unstructured inputs and voice lets me take best advantage of that.
Voice output, on the other hand, is completely useless unless perhaps it can play at 4x speed. But I need to be able to skim LLM output quickly and revisit important points repeatedly. Can't see why I'd ever want to serialize and slow that down.
Level 12: agent superintelligence - single entity doing everything
Level 13: agent superagent, agenting agency agentically, in a loop, recursively, mega agent, agentic agent agent agency super AGI agent
Level 14: A G E N T
Spec driven development can reduce the amount of re-implementation that is required due to requirements errors, but we need faster validation cycles. I wrote a rant about this topic: https://sibylline.dev/articles/2026-01-27-stop-orchestrating...
That's a smell for where the author and maybe even the industry is.
Agents don't have any purpose or drive like human do, they are probabilistic machines, so eventually they are limited by the amount of finite information they carry. Maybe that's what's blocking level 8, or blocking it from working like a large human organization.
Until you build an AI oncaller to handle customer issues in the middle of the night (and depending on your product an AI who can be fired if customer data is corrupted/lost), no team should be willing to remove the "human reviews code step.
For a real product with real users, stability is vastly more important than individual IC velocity. Stability is what enables TEAM velocity and user trust.
https://factory.strongdm.ai/techniques
Techniques covered in-depth + Attractor open source implementations:
https://factory.strongdm.ai/products/attractor#community
https://github.com/search?q=strongdm+attractor&type=reposito...
https://github.com/strongdm/attractor/forks
I'm continuing to study and refine my approach to leverage all this.
I spend a great deal of my time planning and assessing/reviewing through various mechanisms. I think I do codify in ways when I create a skill for any repeated assessment or planning task.
> To be clear, planning as a general practice isn't going away. It's just changing shape. For newer practitioners, plan mode remains the right entry point (as described in Levels 1 and 2). But for complex features at Level 7, "planning" looks less like writing a step-by-step outline and more like exploration: probing the codebase, prototyping options in worktrees, mapping the solution space. And increasingly, background agents are doing that exploration for you.
I mean, it's worth noting that a lot of plan modes are shaped to do the Socratic discovery before creating plans. For any user level. Advanced users probably put a great deal of effort (or thought) into guiding that process themselves.
> ralph loops (later on)
Ralph loops have been nothing but a dramatic mess for me, honestly. They disrupt the assessment process where humans are needed. Otherwise, don't expect them to go craft out extensive PRD without massive issues that is hard to review.
- It would seem that this is a Harness problem in terms of how they keep an agent working and focused on specific tasks (in relation to model capability), but not something maybe a user should initiate on their own.I've experimented with agent teams. However the current implementation (in Claude Code) burns tokens. I used 1 prompt to spin up a team of 9+ agents: Claude Code used up about 1M output tokens. Granted, it was a long; very long horizon task. (It kept itself busy for almost an hour uninterrupted). But 1M+ output tokens is excessive. What I also find is that for parallel agents, the UI is not good enough yet when you run it in the foreground. My permission management is done in such a way that I almost never get interrupted, but that took a lot of investment to make it that way. Most users will likely run agent teams in an unsafe fashion. From my point of view the devex for agent teams does not really exist yet.
Like imagine if you could go back in time and servlets and applets are the big new thing. You wouldn’t like to spend your time learning about those technologies, but your boss would be constantly telling that it is the future. So boring
Also, I’m struggling to take it to multiple agents level, mostly because things depend on each other in the project - most changes cut across UI, protocol and the server side, so not clear how agents would merge incompatible versions.
Verification is a tricky part as well, all tests could be passing, including end to end integration and visual tests, but my verification still catches things like data is not persisted or crypto signatures not verified.
The idea that Claude/Cursor are the new high level programming language for us to work in introduces the problem that we're not actually committing code in this "natural language", we're committing the "compiled" output of our prompting. Which leaves us reviewing the "compiled code" without seeing the inputs (eg: the plan, prompt history, rules, etc.)
The more I try to use these tools to push up this "ladder" the more it becomes clear the technology is a no more than a 10x better Google search.
I let agents run wild on frontend JS because I don't know it well and trust them (and an output I can look at).
This is also where I do most of my AI use. It’s the safe spot where I’m not going to accidentally send proprietary info to an unknown number of eyeballs(computer or human).
It’s also just cumbersome enough that I’m not relying on it too much and stunting my personal ability growth. But I’m way more novice than most on here.
For personal projects, I'm able to use it a bit more directly, but would say I'm using it around 5/6 level as defined here... I've leaned on it a bit for planning stages, which helps a lot... not sure I trust swarms of automated agents, though it's pretty much the only way you're going to use the $200 level on Claude effectively... I've hit the limits on the $100 only twice in the past month, I downgraded after my first month. And even then, it just forced me to take a break for an hour.
Newer models are only marginally better at ignoring the distractors, very little has actually changed, and managing the context matters just as much as a year ago. People building agents just largely ignore that inefficiency and concentrate on higher abstraction levels, compensating it with token waste. (which the article is also discussing)
I am feeling like to go back to Level 5.
Level 6 helps with fixing bugs, but adding a new feature in a scalable way is not working out for me, I feed bunch of documents and ask it to analyze and come up with a solution.
1. It misses some details from docs when summarizing
2. It misses some details from code and its architecture, especially in multi-repo Java projects (annotations, 100 level inheritance is making it confuse a lot)
3. Then comes up with obvious (non) "solution" which is based on incorrect context summaries.
I don't think I can give full autonomy to these things yet.
But then, I wonder, people on Level 8, why don't they create bunch of clones of games, SaaS vendors and start making billions
Speak for yourself.
Also Level 7 is a misunderstanding of why plan mode is actually used even though one-shot works perfectly
Moving past that, I'm not sure that I really trust it... I feel that manual review of product behavior and code matters a lot. AI agents often make similar mistakes to real people in leaking abstractions or subtle mistakes with security... So I do review almost everything, at least at the level where a feature PR makes sense. Though an AI pass at that can help too.
sjkoelle•1d ago