Just talk to it – A way of agentic engineering

https://steipete.me/posts/just-talk-to-it

100•freediver•9h ago

Comments

squirrel•8h ago

I have to imagine that like pair programming, this multi-AI approach would be significantly more tiring than one-window, one-programmer coding. Do you have to force yourself to take breaks to keep up your stamina?

throw-10-13•6h ago

The human context window is the limiting factor.

That and testing/reviewing the insane amounts of ai slop this method generates.

XenophileJKO•6h ago

Well I don't use as many instances as they do, but using codex for example will take some time researching the code. It mostly just becomes a practice so that I don't have to wait for the model, I just context switch to a different one. It probably helps that I am ADHD, I don't pay much of a cost to ping pong around.

So I might tell one to look back in the git history to when something was removed and add it back into a class. So it will figure out what commit added it, what removed it, and then add the code back in.

While that terminal is doing that, on another I can kick off another agent to make some fixes for something else that I need to knock out in another project.

I just ping pong back to the first window to look at the code and tell it to add a new unit test for the new possible state inside the class it modified and I'm done.

I may also periodically while working have a question about a best practice or something that I'll kick off in browser and leave it running to read later.

This is not draining, and I keep a flow because I'm not sitting and waiting on something, they are waiting on me to context switch back.

philipp-gayret•6h ago

> But Claude Code now has Plugins

> Do you hear that noise in the distance? It’s me sigh-ing. (...) Yes, maintaining good documents for specific tasks is a good idea. I keep a big list of useful docs in a docs folder as markdown.

I'm not that familiar with Claude Code Plugins, but it looks like it allows integrations with Hooks, which is a lot more powerful than just giving more context. Context is one thing, but Hooks let you codify guardrails. For example where I work we have a setup for Claude Code that guides it through common processes, like how to work with Terraform, Git or manage dependencies and the whitelisting or recommendation towards dependencies. You can't guarantee this just by slapping on more context. With Hooks you can both auto-approve or auto-deny _and_ give back guidance when doing so, for me this is a killer feature of Claude Code that lets it act more intelligently without having to rely on it following context or polluting the context window.

Cursor recently added a feature much like Claude Code's hooks, I hope to see it in Codex too.

hansmayer•6h ago

So, just trying to understand this - he admits to code being slop and in the same sentences states that agents (which created the slop in the first place) also refactor it? Where is the logic in that?

grim_io•5h ago

He, who does not produce slop in the first iteration, cast the first stone.

fhd2•4h ago

I feel in this context, refactoring has lost its meaning a bit. Sure, it's often used analogous to changes that don't affect semantics. But originally, the idea was that you make a quick change to solve the problem / test the idea, and then spend some time on properly integrating the changes in the existing system.

LLMs struggle with simplicity in my experience, so they struggle with the first step. They also lack the sort of intelligence required to understand (let alone evolve) the system's design, so they will struggle with the second step as well.

So maybe what's meant here is not refactoring in the original meaning, but rather "cleanup". You can do it in the original way with LLMs, but that means you'll have to be incredibly micro manage-y, in my experience. Any sort of vibe coding doesn't lead to anything I'd call refactoring.

ImaCake•3h ago

> LLMs struggle with simplicity in my experience

I think a lot of this is because people (and thus LLMs) use verbosity as a signal for effort. It's a very bad signal, especially for software, but its a very popular signal. Most writing is much longer than it needs to be, everything from SEO website recipes, consulting reports, and non-fiction books. Both the author and the readers are often fooled into thinking lots of words are good.

It's probably hard to train that out of an LLM, especially if they see how that verbosity impressess the people making the purchasing decisions.

sebstefan•4h ago

"So, just trying to understand this - he admits to code being buggy and in the same sentences states that he (the engineer who created the bugs in the first place) also debugs it? Where is the logic in that?"

hansmayer•2h ago

Are you seriously comparing outputs of human intelligence to a text generator ?

sebstefan•1h ago

No I'm highlighting that the logic itself is stupid

If the point is that you can't solve with AI what you messed up with AI, but with human intelligence spending a bit more time on the problem does indeed tend to help, you need to explain why his technique with the AI won't work either.

Plus he's adding human input to it every time, so I see no reason to default to "it wouldn't work".

hansmayer•1h ago

Well you said it yourself, you are literally comparing human intelligence with the so-called AI, or better said, advanced text generator. The differentiator being, the text generators have 0 intelligence, otherwise there would not be a flood of AI gurus explaining the latest trick to making them finally work.

barrkel•4h ago

It all sounds somewhat impressive (300k lines written and maintained by AI) but it's hard to judge how well the experience transfers without seeing the code and understanding the feature set.

For example, I have some code which is a series of integrations with APIs and some data entry and web UI controls. AI does a great job, it's all pretty shallow. The more known the APIs, the better able AI is to fly through that stuff.

I have other code which is well factored and a single class does a single thing and AI can make changes just fine.

I have another chunk of code, a query language, with a tokenizer, parser, syntax tree, some optimizations, and it eventually constructs SQL. Making changes requires a lot of thought from multiple angles and I could not safely give a vague prompt and expect good results. Common patterns need to fall into optimized paths, and new constructs need consideration about how they're going to perform, and how their syntax is going to interact with other syntax. You need awareness not just of the language but also the schema and how the database optimizes based on the data distribution. AI can tinker around the edges but I can't trust it to make any interesting changes.

timr•3h ago

On the contrary, this doesn’t sound impressive at all. It sounds like a cowboy coder working on relatively small projects.

300k LOC is not particularly large, and this person’s writing and thinking (and stated workflow) is so scattered that I’m basically 100% certain that it’s a mess. I’m using all of the same models, the same tools, etc., and (importantly) reading all of the code, and I have 0% faith in any of these models to operate autonomously. Also, my opinion on the quality of GPT-5 vs Claude vs other models is wildly different.

There’s a huge disconnect between my own experience and what this person claims to be doing, and I strongly suspect that the difference is that I’m paying attention and routinely disgusted by what I see.

sarchertech•3h ago

300k especially isn’t impressive if it should have been 10k.

timr•3h ago

Yes, well put. And that’s a common failure mode.

CuriouslyC•3h ago

To be fair, codebases are bimodal, and 300k is large for the smaller part of the distribution. Large enterprise codebases tend to be monorepos, have a ton of generated code and a lot of duplicated functionality for different environments, so the 10-100 million line claims need to be taken with a grain of salt, a lot of the sub projects in them are well below 300k even if you pull in defs.

srameshc•4h ago

Everytime I read something like this, I question myself, what I am doing wrong ? And I tried all kinds of AI tools. But I am not even close to claiming that AI writes 50% of my code. My work which sometimes include feature enhancements and maintenance is where I get even less. I have to be extremely careful and make sure nothing unwanted or addition that I am unaware of has been added. Maybe it's me and I am not good yet to get to 100% AI code generation.

fhennig•3h ago

I'm in the same boat, I still find the model to make mistakes or solve things in a less than ideal way - maybe the future is to just not care - but for now I want to maintain the level of quality that the codebase currently has.

I think it's good to keep up with what early adopters are doing, but I'm not too fussed about missing something. The plugins is a good example: A few weeks ago there was a post on HN where someone said they are using 18 or 25 or whatever plugins and it's the future, now this person says they are using none. I'm still waiting for the dust to settle, I'm not in a rush.

CuriouslyC•3h ago

The person using 25 plugins is giving bad advice. The agent isn't going to need all those tools at once, and each tool burns context and causes tool confusion. Enable MCPs for the specific task you're going to have your agent do.

The trick is to create deterministic hurdles the LLM has to jump over. Tests, linting, benchmarks, etc. You can even do this with diff size to enforce simpler code, tell an agent to develop a feature and keep the character count of the diff below some threshold, and it'll iterate on pruning the solution.

troupo•3h ago

You're not doing anything wrong. You have to read past hyperbole.

Here's how the article starts: "Agentic engineering has become so good that it now writes pretty much 100% of my code. And yet I see so many folks trying to solve issues and generating these elaborated charades instead of getting sh*t done."

Here's how it continues:

- I run between 3-8 in parallel

- My agents do git atomic commits, I iterated a lot on the agents file: https://gist.github.com/steipete/d3b9db3fa8eb1d1a692b7656217...

- I currently have 4 OpenAI subs and 1 Anthropic sub, so my overall costs are around 1k/month for basically unlimited tokens.

- My current approach is usually that I start a discussion with codex, I paste in some websites, some ideas, ask it to read code, and we flesh out a new feature together.

- If you do a bigger refactor, codex often stops with a mid-work reply. Queue up continue messages if you wanna go away and just see it done

- When things get hard, prompting and adding some trigger words like “take your time” “comprehensive” “read all code that could be related” “create possible hypothesis” makes codex solve even the trickiest problems.

- My Agent file is currently ~800 lines long and feels like a collection of organizational scar tissue. I didn’t write it, codex did.

It's the same magical incantations and elaborated charades as everyone does. The "the no-bs Way of Agentic Engineering" is full of bs and has nothing concrete except a single link to a bunch of incantations for agents. No idea what his actual "website + tauri app + mobile app" is that he build 100% with AI, but depending on actual functionality, after burning $1000 a month on tokens you may actually have a fully functioning app in React + Typescript with little human supervision.

1718627440•3h ago

> $1000 a month

Yeah at this point you could hire a software developer.

TheMrZZ•3h ago

I don't know how much SWE get paid in your area, but I sure hope it's not 1000$/month.

Though I'm aligned that I don't (yet) believe in this "AI writes all my code for me" statements.

1718627440•3h ago

It includes that with AI you still need someone to work. First to query the AI and then to fix up something and to bring it in a form you can release and use.

stocksinsmocks•28m ago

$5.75/hr is well below outsourced rates. It’s $1.40/hr if the agent runs without stopping. If I hired a human consultant for a project of any size, I could easily spend $10,000 or more on just scoping and contract approval. Humans don’t win on cost.

tptacek•1h ago

Same! Half would be a lot for me. I'm also not close to the point where I'm comfortable merging LLM-authored PRs without line-by-line reviews.

sarchertech•3h ago

Another person building zero stakes tools for AI coding. 300k LOC also isn’t impressive if it should have been 10k.

N_Lens•3h ago

I'm currently satisfied with Claude Code, but this article seems to sing the praises of Codex. I am dubious Whether it's actually superior or it's 'organic marketing' by OAI (Given that they undoubtedly do this, and other shady practices).

I'll give codex a try later to compare.

esafak•1h ago

Codex is about as capable as Sonnet, but slower. One advantage is that it more readily pushes back against requests, like the article noted.

dinkleberg•40m ago

I’ve found codex to be a much more thorough code reviewer.

Recently I’ve been using Claude for code gen and codex for review.

I keep trying to use Gemini as it is so fast, but it is far inferior in every other way in my experience.

CuriouslyC•3h ago

Use of the bell curve for this meme considered harmful.

If you're going to use AI like that, it's not a clear win over writing the code yourself (unless you're a mid programmer). The whole point of AI is to automate shit, but you've planted a flag on the minimal level of automation you're comfortable with and proclaimed a pareto frontier that doesn't exist.

mherrmann•3h ago

I use Claude Code every day and find that it still requires a lot of hand-holding. Maybe codex is better. But just in my last session today, Claude wrote 100 lines of test code that could have been 20, and 30 lines of production code that could have been 5. I'm glad I do not have to maintain 300 kloc of 100% AI-generated code. But at the end of the day, what counts is velocity and quality, and it seems OP is happy. The tools certainly are useful.

cruffle_duffle•3h ago

Have any of these no-bs articles described how they handle schema changes? Are they even using a “real” DB or is it all local, single user sqllite? Because I can see a disaster looming letting a vibe coder’s agent loose on a database.

And does it require auth? How is that spec’d out and validated? What about RBAC or anything? How would you even get the LLM to constantly follow rules for that?

Don’t get me wrong these tools are pretty cool but the old adage “if it sounds too good to be true, it probably is” always applies.

lmeyerov•30m ago

Senior engineers know process, including for what you described, and that maps to plan-driven AI engineering well:

1. Note the discussion of plan-driven development in the claude code sections (think: plan = granular task list, including goals & validation criteria, that the agent loops over and self-modifies). Plans are typically AI generated: I ask it to do initial steps of researching current patterns for x+y+z and include those in the steps and validations, and even have it re-audit a plan. Codex internally works the same, and multiple people are reporting it automates more of this plan flow.

2. Working with database for tasks like migrations is normal and even better. My two UIs are now the agent CLI (basically streaming AI chat for task list monitoring & editing) and GitHub PR viewer: if it wasn't smart enough to add and test migrations and you didn't put that into the plan, you see it in the PR review and tell it to fix that. Writing migrations is easy, but testing them is annoying, and I've found AI helping write mocks, integration tests, etc to be wonderful.

pqdbr•3h ago

One more article praising Codex CLI over Claude Code. Decided to give it a try this morning.

A simple task that would have taken literally no more than 2 minutes in Claude Code is, as of now, 9m+ and still "inspecting specific directory", with an ever increasing list of read files, not a single line of code written.

I might be holding it wrong.

kendallchuang•2h ago

What I understand is Codex takes more time to gather context to make the relevant changes; with more context it may give a more precise response than Claude Code.

pqdbr•2h ago

With Claude Code, I can 'tune' the prompt by feeling how much context the model needs to perform it's task. I can mention more files or tell it to read more code as needed.

With one hour of experience of Codex CLI, every single prompt - even the most simple ones - are 5+ minutes of investigation before anything gets done. Unbearable and totally unnecessary.

esafak•2h ago

It has the same context. It does not take that long to ingest a bunch of files. OpenAI is just not offering the same level of performance, probably due to oversubscription.

sarchertech•2h ago

I’m not making any psychiatric diagnoses based on GitHub repos or YouTube videos.

But. Sometimes when I see someone talking about cranking out hundreds of thousands of lines of vibe coded apps, I go watch their YouTube videos, or checkout their dozens of unconnected, half finished repos.

Every single time I get a serious manic vibe.

cruffle_duffle•44m ago

The overlap between crypto hustlers and AI hustlers is pretty interesting. Not strictly “hustle both” type overlap but it’s a similar type of energy. Bully the non-believers and hype the hype regardless of reality.

I dunno. People say these tools trigger the gambling part of your brain. I think there is a lot of merit to that. When these tools work (which they absolutely do) it’s incredible and your brain gets a nice hit of dopamine but holy cow can these tools fail. But if you just keep pulling that lever, keep adding “the right” context and keep casting the right “spells” the AI will perform its magic again and you’ll get your next fix. Just keep at it. Eventually you’ll get it.

Surely somebody somewhere is doing brain imagery when using these tools. I wouldn’t be surprised to see the same parts of the brain light up as when you play something like Candy Crush. Dig deep into the sunk cost fallacy, pepper with an illusion of control and that glorious “I’m on a roll” feeling (how many agents did this dude have active at once?) and boom…

I mean read the post. The dude spends $1000/mo plugging tokens into a grid of 8 parallel agents. They have a term for this in the gaming industry. It’s a whale.

bigblind•2h ago

I'd love to be able to watch people work, who say that they're sucessful with these tools. If there are any devs live streaming software development on Twitch, or just making screen casts (without too many cuts) of how they use these tools in day-to-day work, I'd love to see it.

simonw•2h ago

Armin Ronacher has done some of those: https://www.youtube.com/@ArminRonacher

aredox•2h ago

>With Claude Code I often have multi-second freezes and it’s process blows up to gigabytes of memory.

I am in a mood where I find it excessively funny that, all that talk about AI, agents, billions of dollars, tera-watts/-hours spent, and people still manage to publish posts with the "its/it's" mistake.

(I am not a native English speaker, so I notice it at a higher rate than people who learned English "by ear".)

Maybe you don't care or you find it annoying to have it pointed out, but it says something about fundamentals. You know, "The way you do one thing is the way you do all things".

WA•1h ago

I get what you're saying. "they're" and "their" is also a classic that many non-natives seem to get right, but native speakers have trouble with.

But OP isn't native either. He's Austrian.

codyb•1h ago

I'm very confused.

In the picture right at the top of the article, the top of the bell curve is using 8 agents in parallel, and yada yada yada.

And then they go on to talk about how they're using 9 agents in parallel at a cost of 1000 dollars a month for a 300k line (personal?) project?

I dunno, this just feels like as much effort as actually learning how to write the code yourself and then just doing it, except, at the end... all you have is skills for tuning models that constantly change under you.

And it costs you 1000 dollars a month for this experience?

woeirua•49m ago

300k lines of AIslop that probably would've been 20k lines of code from a human. And the human has no ability to maintain it, or to reason about it.

esafak•1h ago

I wonder how many lines of code he generates and reviews a day. With five subscriptions I do not think it is possible to read it all. You can generate more code than you can read with just one subscription.

outside1234•36m ago

Is this paid content from OpenAI?

Apple M5 chip

You are the scariest monster in the woods

Pwning the Entire Nix Ecosystem

M5 MacBook Pro

Show HN: Halloy – the modern IRC client I hope will outlive me

Mac Source Ports – Run old games on new Macs

Something is broken with the way we measure success on the internet

FSF announces Librephone project

Show HN: Scriber Pro – Offline AI transcription for macOS

iPad Pro with M5 chip

I almost got hacked by a 'job interview'

Apple Vision Pro upgraded with M5 chip

Helpcare AI (YC F24) Is Hiring

Ireland is making basic income for artists program permanent

Garbage Collection for Rust: The Finalizer Frontier

Reverse Engineering a 27MHz RC Toy Communication Using RTL SDR

Pixnapping Attack

Oops It's a kernel stack use-after-free: Exploiting Nvidia's GPU Linux drivers

Flapping-wing robot achieves self-takeoff by adopting reconfigurable mechanisms

We Raised $5.7M to Launch Cto.new Completely for Free

Leaving serverless led to performance improvement and a simplified architecture

ASP.NET Security Feature Bypass Vulnerability

OVM6948 Miniature Camera Module [pdf]

Show HN: Firm, a text-based work management system

DOJ seizes $15B in Bitcoin from 'pig butchering' scam based in Cambodia

Europe's Digital Sovereignty Paradox – "Chat Control" Update

Beliefs that are true for regular software but false when applied to AI

Show HN: Trott – search,sort,extract social media videos(ig,yt,tiktok)

HRM Analysis by Arc Prize Organizers

Just talk to it – A way of agentic engineering

Apple M5 chip

You are the scariest monster in the woods

Pwning the Entire Nix Ecosystem

M5 MacBook Pro

Show HN: Halloy – the modern IRC client I hope will outlive me

Mac Source Ports – Run old games on new Macs

Something is broken with the way we measure success on the internet

FSF announces Librephone project

Show HN: Scriber Pro – Offline AI transcription for macOS

iPad Pro with M5 chip

I almost got hacked by a 'job interview'

Apple Vision Pro upgraded with M5 chip

Helpcare AI (YC F24) Is Hiring

Ireland is making basic income for artists program permanent

Garbage Collection for Rust: The Finalizer Frontier

Reverse Engineering a 27MHz RC Toy Communication Using RTL SDR

Pixnapping Attack

Oops It's a kernel stack use-after-free: Exploiting Nvidia's GPU Linux drivers

Flapping-wing robot achieves self-takeoff by adopting reconfigurable mechanisms

We Raised $5.7M to Launch Cto.new Completely for Free

Leaving serverless led to performance improvement and a simplified architecture

ASP.NET Security Feature Bypass Vulnerability

OVM6948 Miniature Camera Module [pdf]

Show HN: Firm, a text-based work management system

DOJ seizes $15B in Bitcoin from 'pig butchering' scam based in Cambodia

Europe's Digital Sovereignty Paradox – "Chat Control" Update

Beliefs that are true for regular software but false when applied to AI

Show HN: Trott – search,sort,extract social media videos(ig,yt,tiktok)

HRM Analysis by Arc Prize Organizers

Just talk to it – A way of agentic engineering

Just talk to it – A way of agentic engineering

Comments