https://github.com/srid/AI/blob/master/commands/plan.md#2-pl...
It works very similar to Antigravity's plan document comment-refine cycle.
There's no winner for "least amount of code written regardless of productivity outcomes.", except for maybe Anthropic's bank account.
That said, if you're on a serious team writing professional software there is still tons of value in always telling AI to plan first, unless it's a small quick task. This post just takes it a few steps further and formalizes it.
I find Cursor works much more reliably using plan mode, reviewing/revising output in markdown, then pressing build. Which isn't a ton of overhead but often leads to lots of context switching as it definitely adds more time.
Yesterday I had Claude write an audit logging feature to track all changes made to entities in my app. Yeah you get this for free with many frameworks, but my company's custom setup doesn't have it.
It took maybe 5-10 minutes of wall-time to come up with a good plan, and then ~20-30 min for Claude implement, test, etc.
That would've taken me at least a day, maybe two. I had 4-5 other tasks going on in other tabs while I waited the 20-30 min for Claude to generate the feature.
After Claude generated, I needed to manually test that it worked, and it did. I then needed to review the code before making a PR. In all, maybe 30-45 minutes of my actual time to add a small feature.
All I can really say is... are you sure you're using it right? Have you _really_ invested time into learning how to use AI tools?
Fast forward to today and I tried the tools again--specifically Claude Code--about a week ago. I'm blown away. I've reproduced some tools that took me weeks at full-time roles in a single day. This is while reviewing every line of code. The output is more or less what I'd be writing as a principal engineer.
You could've been curious and ask why it would take 1-2 days, and I would've happily told you.
I wanted to add audit logging for all endpoints we call, all places we call the DB, etc. across areas I haven't touched before. It would have taken me a while to track down all of the touchpoints.
Granted, I am not 100% certain that Claude didn't miss anything. I feel fairly confident that it is correct given that I had it research upfront, had multiple agents review, and it made the correct changes in the areas that I knew.
Also I'm realizing I didn't mention it included an API + UI for viewing events w/ pretty deltas
Some things are complex.
The problem is LLMs are great at simple implementation, even large amounts of simple implementation, but I've never seen it develop something more than trivial correctly. The larger problem is it's very often subtly but hugely wrong. It makes bad architecture decisions, it breaks things in pursuit of fixing or implementing other things. You can tell it has no concept of the "right" way to implement something. It very obviously lacks the "senior developer insight".
Maybe you can resolve some of these with large amounts of planning or specs, but that's the point of my original comment - at what point is it easier/faster/better to just write the code yourself? You don't get a prize for writing the least amount of code when you're just writing specs instead.
This is 100% incorrect, but the real issue is that the people who are using these llms for non-trivial work tend to be extremely secretive about it.
For example, I view my use of LLMs to be a competitive advantage and I will hold on to this for as long as possible.
Does it write maintainable code? Does it write extensible code? Does it write secure code? Does it write performant code?
My experience has been it failing most of these. The code might "work", but it's not good for anything more than trivial, well defined functions (that probably appeared in it's training data written by humans). LLMs have a fundamental lack of understanding of what they're doing, and it's obvious when you look at the finer points of the outcomes.
That said, I'm sure you could write detailed enough specs and provide enough examples to resolve these issues, but that's the point of my original comment - if you're just writing specs instead of code you're not gaining anything.
The compounding is much greater than my brain can do on its own.
But the aha moment for me was what’s maintainable by AI vs by me by hand are on different realms. So maintainable has to evolve from good human design patterns to good AI patterns.
Specs are worth it IMO. Not because if I can spec, I could’ve coded anyway. But because I gain all the insight and capabilities of AI, while minimizing the gotchas and edge failures.
But did you truly think about such feature? Like guarantees that it should follow (like how do it should cope with entities migration like adding a new field) or what the cost of maintaining it further down the line. This looks suspiciously like drive-by PR made on open-source projects.
> That would've taken me at least a day, maybe two.
I think those two days would have been filled with research, comparing alternatives, questions like "can we extract this feature from framework X?", discussing ownership and sharing knowledge,.. Jumping on coding was done before LLMs, but it usually hurts the long term viability of the project.
Adding code to a project can be done quite fast (hackatons,...), ensuring quality is what slows things down in any any well functioning team.
I find the best way to use agents (and I don't use claude) is to hash it out like I'm about to write these changes and I make my own mental notes, and get the agent to execute on it.
Agents don't get tired, they don't start fat fingering stuff at 4pm, the quality doesn't suffer. And they can be parallelised.
Finally, this allows me to stay at a higher level and not get bogged down of "right oh did we do this simple thing again?" which wipes some of the context in my mind and gets tiring through the day.
Always, 100% review every line of code written by an agent though. I do not condone committing code you don't 'own'.
I'll never agree with a job that forces developers to use 'AI', I sometimes like to write everything by hand. But having this tool available is also very powerful.
This new version that I'm doing (from scratch with ChatGPT web) has a far more ambitious scope and is already at the "usable" point. Now I'm primarily solidifying things and increasing test coverage. And I've tested the key parts with IRL scenarios to validate that it's not just passing tests; the thing actually fulfills its intended function so far. Given the increased scope, I'm guessing it'd take me a few months to get to this point on my own, instead of under a week, and the quality wouldn't be where it is. Not saying I haven't had to wrangle with ChatGPT on a few bugs, but after a decent initial planning phase, my prompts now are primarily "Do it"s and "Continue"s. Would've likely already finished it if I wasn't copying things back and forth between browser and editor, and being forced to pause when I hit the message limit.
I recommend to try out Opencode with this approach, you might find it less tiring than ChatGPT web (yes it works with your ChatGPT Plus sub).
https://github.com/backnotprop/plannotator Plannotator does this really effectively and natively through hooks
Really nice ui based on the demo.
This makes no sense to my intuition of how an LLM works. It's not that I don't believe this works, but my mental model doesn't capture why asking the model to read the content "more deeply" will have any impact on whatever output the LLM generates.
Now? We have AGENTS.md files that look like a parent talking to a child with all the bold all-caps, double emphasis, just praying that's enough to be sure they run the commands you want them to be running
(1 Outside of some core ML developers at the big model companies)
thats hilarious. i definitely treat claude like shit and ive noticed the falloff in results.
if there's a source for that i'd love to read about it.
See, uhhh, https://pmc.ncbi.nlm.nih.gov/articles/PMC8052213/ and maybe haave a shot at running claude while playing Enya albums on loop.
/s (??)
I am not sure if we know why really, but they are that way and you need to explicitly prompt around it.
Lazy thinking makes LLMs do surface analysis and then produce things that are wrong. Neurotic thinking will see them over-analyze, and then repeatedly second-guess themselves, repeatedly re-derive conclusions.
Something very similar to an anxiety loop in humans, where problems without solutions are obsessed about in circles.
think of the latent space inside the model like a topological map, and when you give it a prompt, you're dropping a ball at a certain point above the ground, and gravity pulls it along the surface until it settles.
caveat though, thats nice per-token, but the signal gets messed up by picking a token from a distribution, so each token you're regenerating and re-distorting the signal. leaning on language that places that ball deep in a region that you want to be makes it less likely that those distortions will kick it out of the basin or valley you may want to end up in.
if the response you get is 1000 tokens long, the initial trajectory needed to survive 1000 probabilistic filters to get there.
or maybe none of that is right lol but thinking that it is has worked for me, which has been good enough
- You are a Python Developer... or - You are a Professional Python Developer... or - You are one of the World most renowned Python Experts, with several books written on the subject, and 15 years of experience in creating highly reliable production quality code...
You will notice a clear improvement in the quality of the generated artifacts.
Same reason that "Pretend you are an MIT professor" or "You are a leading Python expert" or similar works in prompts. It tells the model to pay attention to the part of the corpus that has those terms, weighting them more highly than all the other programming samples that it's run across.
1. Use brainstorming to come up with the plan using the Socratic method
2. Write a high level design plan to file
3. I review the design plan
4. Write an implementation plan to file. We've already discussed this in detail, so usually it just needs skimming.
5. Use the worktree skill with subagent driven development skill
6. Agent does the work using subagents that for each task:
a. Implements the task
b. Spec reviews the completed task
c. Code reviews the completed task
7. When all tasks complete: create a PR for me to review8. Go back to the agent with any comments
9. If finished, delete the plan files and merge the PR
The key insight here - that planning and execution should be distinct phases - applies to productivity tools too. I've been using www.dozy.site which takes a similar philosophy: it has smart calendar scheduling that automatically fills your empty time slots with planned tasks. The planning happens first (you define your tasks and projects), then the execution is automated (tasks get scheduled into your calendar gaps).
The parallel is interesting: just like you don't want Claude writing code before the plan is solid, you don't want to manually schedule tasks before you've properly planned what needs to be done. The separation prevents wasted effort and context switching.
The annotation cycle you describe (plan -> review -> annotate -> refine) is exactly how I work with my task lists too. Define the work, review it, adjust priorities and dependencies, then let the system handle the scheduling.
- Specs: these are generally static, but updatable as the project evolves. And they're broken out to an index file that gives a project overview, a high-level arch file, and files for all the main modules. Roughly ~1k lines of spec for 10k lines of code, and try to limit any particular spec file to 300 lines. I'm intimately familiar with every single line in these.
- Plans: these are the output of a planning session with an LLM. They point to the associated specs. These tend to be 100-300 lines and 3 to 5 phases.
- Working memory files: I use both a status.md (3-5 items per phase roughly 30 lines overall), which points to a latest plan, and a project_status (100-200 lines), which tracks the current state of the project and is instructed to compact past efforts to keep it lean)
- A planner skill I use w/ Gemini Pro to generate new plans. It essentially explains the specs/plans dichotomy, the role of the status files, and to review everything in the pertinent areas of code and give me a handful of high-level next set of features to address based on shortfalls in the specs or things noted in the project_status file. Based on what it presents, I select a feature or improvement to generate. Then it proceeds to generate a plan, updates a clean status.md that points to the plan, and adjusts project_status based on the state of the prior completed plan.
- An implementer skill in Codex that goes to town on a plan file. It's fairly simple, it just looks at status.md, which points to the plan, and of course the plan points to the relevant specs so it loads up context pretty efficiently.
I've tried the two main spec generation libraries, which were way overblown, and then I gave superpowers a shot... which was fine, but still too much. The above is all homegrown, and I've had much better success because it keeps the context lean and focused.
And I'm only on the $20 plans for Codex/Gemini vs. spending $100/month on CC for half year prior and move quicker w/ no stall outs due to token consumption, which was regularly happening w/ CC by the 5th day. Codex rarely dips below 70% available context when it puts up a PR after an execution run. Roughly 4/5 PRs are without issue, which is flipped against what I experienced with CC and only using planning mode.
Speckit is worth trying as it automates what is being described here, and with Opus 4.6 it's been a kind of BC/AD moment for me.
However, there is a caveat. LLMs resist ambiguity about authority. So the "PCL" or whatever you want to call it, needs to be the ONE authoritative place for everything. If you have the same stuff in 3 different files, it won't work nearly as well.
Bonus Tip: I find long prompt input with example code fragments and thoughtful descriptions work best at getting an LLM to produce good output. But there will always be holes (resource leaks, vulnerabilities, concurrency flaws, etc). So then I update my original prompt input (keep it in a separate file PROMPT.txt as a scratch pad) to add context about those things maybe asking questions along the way to figure out how to fix the holes. Then I /rewind back to the prompt and re-enter the updated prompt. This feedback loop advances the conversation without expending tokens.
This has changed in the last week, for 3 reasons:
1. Claude opus. It’s the first model where I haven’t had to spend more time correcting things than it would’ve taken me to just do it myself. The problem is that opus chews through tokens, which led to..
2. I upgraded my Claude plan. Previously on the regular plan I’d get about 20 mins of time before running out of tokens for the session and then needing to wait a few hours to use again. It was fine for little scripts or toy apps but not feasible for the regular dev work I do. So I upgraded to 5x. This now got me 1-2 hours per session before tokens expired. Which was better but still a frustration. Wincing at the price, I upgraded again to the 20x plan and this was the next game changer. I had plenty of spare tokens per session and at that price it felt like they were being wasted - so I ramped up my usage. Following a similar process as OP but with a plans directory with subdirectories for backlog, active and complete plans, and skills with strict rules for planning, implementing and completing plans, I now have 5-6 projects on the go. While I’m planning a feature on one the others are implementing. The strict plans and controls keep them on track and I have follow up skills for auditing quality and performance. I still haven’t hit token limits for a session but I’ve almost hit my token limit for the week so I feel like I’m getting my money’s worth. In that sense spending more has forced me to figure out how to use more.
3. The final piece of the puzzle is using opencode over claude code. I’m not sure why but I just don’t gel with Claude code. Maybe it’s all the sautéing and flibertygibbering, maybe it’s all the permission asking, maybe it’s that it doesn’t show what it’s doing as much as opencode. Whatever it is it just doesn’t work well for me. Opencode on the other hand is great. It’s shows what it’s doing and how it’s thinking which makes it easy for me to spot when it’s going off track and correct early.
Having a detailed plan, and correcting and iterating on the plan is essential. Making clause follow the plan is also essential - but there’s a line. Too fine grained and it’s not as creative at solving problems. Too loose/high level and it makes bad choices and goes in the wrong direction.
Is it actually making me more productive? I think it is but I’m only a week in. I’ve decided to give myself a month to see how it all works out.
I don’t intend to keep paying for the 20x plan unless I can see a path to using it to earn me at least as much back.
I burned through $10 on Claude in less than an hour. I only have $36 a day at $800 a month (800/22 working days)
It doesn’t seem controversial that the model that can solve more complex problems (that you admit the cheaper model can’t solve) costs more.
For the things I use it for, I’ve not found any other model to be worth it.
> ...
> never let Claude write code until you’ve reviewed and approved a written plan
I certainly always work towards an approved plan before I let it lost on changing the code. I just assumed most people did, honestly. Admittedly, sometimes there's "phases" to the implementation (because some parts can be figured out later and it's more important to get the key parts up and running first), but each phase gets a full, reviewed plan before I tell it to go.
In fact, I just finished writing a command and instruction to tell claude that, when it presents a plan for implementation, offer me another option; to write out the current (important parts of the) context and the full plan to individual (ticket specific) md files. That way, if something goes wrong with the implementation I can tell it to read those files and "start from where they left off" in the planning.
We all tend to regress to average (same thoughts/workflows)...
Have had many users already doing the exact same workflow with: https://github.com/backnotprop/plannotator
I craft a detailed and ordered set of lecture notes in a Quarto file and then have a dedicated claude code skill for translating those notes into Slidev slides, in the style that I like.
Once that's done, much like the author, I go through the slides and make commented annotations like "this should be broken into two slides" or "this should be a side-by-side" or "use your generate clipart skill to throw an image here alongside these bullets" and "pull in the code example from ../examples/foo." It works brilliantly.
And then I do one final pass of tweaking after that's done.
But yeah, annotations are super powerful. Token distance in-context and all that jazz.
The author mentions annotations but doesn't go into detail about how to feed the annotations to Claude.
zitrusfrucht•1h ago
gbnwl•1h ago
ramoz•16m ago
While it hooks directly into plan mode, the annotate command works for any custom markdown approach - annotate in a nice visual, and automatically send the feedback to the agent