The quality of AI-assisted software depends on unit of work management

https://blog.nilenso.com/blog/2025/09/15/ai-unit-of-work/

53•mogambo1•1h ago

Comments

datadrivenangel•1h ago

Keep your scope as small as necessary, but no smaller. This has been fundamentally true for project management work breakdown structures for decades.

liszper•59m ago

most SWE folks still have no idea how big the difference is between the coding agents they tried a year ago and declared as useless and chatgpt 5 paired with Codex or Cursor today

thanks for the article, it's a good one

TheRoque•46m ago

It's true that I haven't been a hardcore agent-army vibe coder, I just try the popular ones once in a while in a naive way (isn't it the point of these tools, to have little friction ?), claude code for example. And it's cool ! But imperfect, and as this article attests, there's a lot of mental overhead to even have a shot at getting a decent output. And even if it's decent, it still needs to be reviewed and could include logical flaws.

I'd rather use it the other way, I'm the one in charge, and the AI reviews any logical flaw or things that I would have missed. I don't even have to think about context window since it'll only look at my new code logic.

So yeah, 3 years after the first ChatGPT and Copilot, I don't feel huge changes regarding "automated" AI programming, and I don't have any AI tool in my IDE, I pefer to have a chat using their website, to brainstorm, or occasionally find a solution to something I'm stuck on.

blibble•41m ago

> most SWE folks still have no idea how big the difference is between the coding agents they tried a year ago and declared as useless and chatgpt 5 paired with Codex or Cursor today

yes, just as was said each and every previous time OpenAI/anthropic shit out a new model

"now it doesn't suck!"

Filligree•19m ago

Each and every new model expands the scope of what you can do. You notice that, get elated when things that didn’t work start working, then three weeks later the honeymoon period is over and you notice the remaining limits.

The hedonic treadmill ensures it feels the same way each time.

But that doesn’t mean the models aren’t improving, nor that the scope isn’t expanding. If you compare today’s tools to those a year ago, the difference is stark.

zeroonetwothree•41m ago

I use agents for coding small stuff at work almost every day. I would say there has been some improvement compared to a year ago but it’s not any sort of step change. They still are only able to complete simple “intern-level” tasks around 50% of the time. Which is helpful but not revolutionary.

angusturner•25m ago

I think most SWEs do have a good idea where I work.

They know that its a significant, but not revolutionary improvement.

If you supervise and manage your agents closely on well scoped (small) tasks they are pretty handy.

If you need a prototype and don't care about code quality or maintenance, they are great.

Anyone claiming 2x, 5x, 10x etc is absolutely kidding themselves for any non-trivial software.

liszper•20m ago

I'd argue this just proves my point.

kibwen•22m ago

Last week I wanted to generate some test data for some unit tests for a certain function in a C codebase. It's an audio codec library, so I could have modified the function to dump its inputs to disk and then run the library on any audio file and then hardcoded the input into the unit tests. Instead, I decided I wanted to save a few bytes and wanted to look at generating dummy data dynamically. I wanted to try out Claude for generating the code that would generate the data, so to keep the context manageable I extracted the function and all its dependencies into a self-contained C program (less than 200 lines altogether) and asked it to write a function that would generate dummy data, in C.

Impressively, it recognized the structure of the code and correctly identified it as a component of an audio codec library, and provided a reasonably complete description of many minute details specific to this codec and the work that the function was doing.

Rather less impressively, it decided to ignore my request and write a function that used C++ features throughout, such as type inference and lambdas, or should I say "lambdas" because it was actually just a function-defined-within-a-function that tried to access and mutate variables outside of its own function scope, like we were writing Javascript or something.

I can see why people would be wowed by this on its face. I wouldn't expect any average developer to have such a depth of knowledge and breadth of pattern-matching ability to be able to identify the specific task that this specific function in this specific audio codec was performing.

At the same time, this is clearly not a tool that's suitable for letting loose on a codebase without EXTREME supervision.

At the end of the day, I got the code working by editing it manually, but in an honest retrospective I would have to admit that the overall process actually didn't save me any time at all.

angusturner•16m ago

I feel this. I've had a few tasks now where in honest retrospect I find myself asking "did that really speed me up". Its a bit demoralising cause not only do you waste time, you have a worse mental model of the resulting code and feel less sense of ownership over the result.

Brainstorming, ideation and small, well defined tasks where I can quickly vet the solution : these feel like the sweet spot for current frontier model capabilities.

(Unless you are pumping out some sloppy React SPA that you don't care about anything except get it working as fast as possible - fine, get Claude code to one shot it)

Filligree•15m ago

There’s been a lot of noise about Claude performance degradation, and the current best option is probably Codex, but this still surprises me. It sounds like it succeeded on the hard part, then stumbled on the easy bit.

Just two questions, if you don’t mind satisfying my curiosity.

- Did you tell it to write C? Or better yet, what was the prompt? You can use Claude --resume to easily find that.

- Which model? (Sooner or Opus)? Though I’d have expected either one to work.

rco8786•10m ago

I still use Claude Code and Cursor and tbh still run into a lot of the same issues. Hallucinating code, hallucinating requirements, even when scoped to a very simple "make this small change".

It's good enough that it helps, particularly in areas or languages that I'm unfamiliar with. But I'm constantly fighting with it.

jonstewart•56m ago

I first tried getting specific with Claude Code. I made the Claude.md, I detailed how to do TDD, what steps it should take, the commands it should run. It was imperfect. Then I had it plan (think hard) and write the plan to a file. I’d clear context, have it read the plan, ask me questions, and then have it decompose the plan into a detailed plan of discrete tasks. Have it work its way through that. It would inevitably go sideways halfway through, even clearing context between each task. It wouldn’t run tests, it would commit breakage, it would flip flop between two different broken approaches, it was just awful. Now I’ve just been vibing, writing as little as possible and seeing what happens. That sucks, too.

It’s amazing at reviewing code. It will identify what you fear, the horrors that lie within the codebase, and it’ll bring them out into the sunlight and give you a 7 step plan for fixing them. And the coding model is good, it can write a function. But it can’t follow a plan worth shit. And if I have to be extremely detailed at the function by function level, then I should be in the editor coding. Claude code is an amazing niche tool for code reviews and dialogue and debugging and coping with new technologies and tools, but it is not a productivity enhancement for daily coding.

liszper•52m ago

With all due respect, you sound like someone who is just getting familiar with these tools. 100 more hours spent with AI coding and you will be much more productive. Coding with AI is a slightly different skill from coding, similar how managing software engineers is different from writing software.

TheRoque•45m ago

Then, it's the job of someone else to use these tools, not developers

liszper•31m ago

I agree with your point. I think this is the reason why most developers still don't get it, because AI coding ultimately requires a "higher level" methodology.

dgfitz•23m ago

"Hacker culture never took root in the 'AI' gold rush because the LLM 'coders' saw themselves not as hackers and explorers, but as temporarily understaffed middle-managers." [0]

This, this is you. This is the entire charade. It seems poetic somehow.

[0]https://news.ycombinator.com/item?id=45123094

liszper•4m ago

I see myself as a hacker.

abtinf•45m ago

liszper:

> most SWE folks still have no idea how big the difference is between the coding agents they tried a year ago and declared as useless and chatgpt 5 paired with Codex or Cursor today

Also liszper: oh, you tried the current approach and don’t agree with me? Well you just don’t know what you are doing.

pjc50•28m ago

Funnily enough the same kind of approach you get from Lisp advocates and the more annoying faction of Linux advocacy (which isn't as prevalent these days, it seems)

liszper•25m ago

I'm also a lisper, yes.

liszper•26m ago

Yes, exactly. Learning new things is hard. Personally it took me about 200 hours to get started, and since then ~2500 hours to get familiar with the advanced techniques, and now I'm very happy with the results, managing extremely large codebases with LLM in production.

For context before that I had ~15 years of experience coding the traditional way.

sarchertech•22m ago

How many users is production and how large is extremely large.

liszper•16m ago

200k DAU, 7 million registered, ~50 microservices, large monorepo

sarchertech•12m ago

You have 50 microservices for 200k daily users?

Let me guess this has something to do with AI?

KDE is now my favorite desktop

Geizhals Preisvergleich Donates USD 10k to the Perl and Raku Foundation

Flipper Zero Geiger Counter

Slack has raised our charges by $195k per year

The quality of AI-assisted software depends on unit of work management

Luau – fast, small, safe, gradually typed scripting language derived from Lua

Fuck, You're Still Sad?

Midcentury North American Restaurant Placemats

Automatic Differentiation Can Be Incorrect

TernFS – An exabyte scale, multi-region distributed filesystem

CERN Animal Shelter for Computer Mice

This Website Has No Class

WASM 3.0 Completed

Show HN: The text disappears when you screenshot it

Pnpm has a new setting to stave off supply chain attacks

You Had No Taste Before AI

CircuitHub (YC W12) Is Hiring Operations Research Engineers (UK/Remote)

Meta Ray-Ban Display

Fast Fourier Transforms Part 1: Cooley-Tukey

Nvidia buys $5B in Intel stock in seismic deal

Keeping SSH sessions alive with systemd-inhibit

Mirror Life Worries

Boring is good

One Token to rule them all – Obtaining Global Admin in every Entra ID tenant

Orange Pi RV2 $40 RISC-V SBC: Friendly Gateway to IoT and AI Projects

An Afternoon at the Recursive Café: Two Threads Interleaving

A postmortem of three recent issues

60 years after Gemini, newly processed images reveal details

History of the Gem Desktop Environment

YouTube addresses lower view counts which seem to be caused by ad blockers

KDE is now my favorite desktop

Geizhals Preisvergleich Donates USD 10k to the Perl and Raku Foundation

Flipper Zero Geiger Counter

Slack has raised our charges by $195k per year

The quality of AI-assisted software depends on unit of work management

Luau – fast, small, safe, gradually typed scripting language derived from Lua

Fuck, You're Still Sad?

Midcentury North American Restaurant Placemats

Automatic Differentiation Can Be Incorrect

TernFS – An exabyte scale, multi-region distributed filesystem

CERN Animal Shelter for Computer Mice

This Website Has No Class

WASM 3.0 Completed

Show HN: The text disappears when you screenshot it

Pnpm has a new setting to stave off supply chain attacks

You Had No Taste Before AI

CircuitHub (YC W12) Is Hiring Operations Research Engineers (UK/Remote)

Meta Ray-Ban Display

Fast Fourier Transforms Part 1: Cooley-Tukey

Nvidia buys $5B in Intel stock in seismic deal

Keeping SSH sessions alive with systemd-inhibit

Mirror Life Worries

Boring is good

One Token to rule them all – Obtaining Global Admin in every Entra ID tenant

Orange Pi RV2 $40 RISC-V SBC: Friendly Gateway to IoT and AI Projects

An Afternoon at the Recursive Café: Two Threads Interleaving

A postmortem of three recent issues

60 years after Gemini, newly processed images reveal details

History of the Gem Desktop Environment

YouTube addresses lower view counts which seem to be caused by ad blockers

The quality of AI-assisted software depends on unit of work management

Comments