Agents that run while I sleep

https://www.claudecodecamp.com/p/i-m-building-agents-that-run-while-i-sleep

87•aray07•1h ago

Comments

BeetleB•1h ago

I wish there was a way to "freeze" the tests. I want to write the tests first (or have Claude do it with my review), and then I want to get Claude to change the code to get them to pass - but with confidence that it doesn't edit any of the test files!

aray07•1h ago

yeah i agree - this is somewhat the approach I have been using more of. Write the tests first based on specs and then write code to make the tests pass. This works well for cases where unit tests are sufficient.

simlevesque•1h ago

I use devcontainers in all the projects I use claude code on. [1] With it you can have claude running inside a container with just the project's code in write access and also mount a test folder with just read permissions, or do the opposite. You can even have both devcontainers and run them at the same time.

[1] https://code.claude.com/docs/en/devcontainer

If you want to try it just ask Claude to set it up for your project and review it after.

paxys•1h ago

Why can't you do just that? You can configure file path permissions in Claude or via an external tool.

kubb•1h ago

"Add a config option preventing you from modifying files matching src/*_test.py."

dboreham•1h ago

Just tell it that the tests can't be changed. Honestly I'd be surprised if it tried to anyway. I've never had it do that through many projects where tests were provided to drive development.

SatvikBeri•1h ago

You can remove edit permissions on the test directory

BeetleB•40m ago

I'm not up to speed on Claude's features. Can I, from the prompt, quickly remove those permissions and then re-add them (i.e. one command to drop, and one command to re-add)?

SatvikBeri•6m ago

Yeah, you can type `/permisssions` and do it there. Or you can make a custom slash command, or just ask Claude to do it. You can also set it when you launch a claude session, there are a dozen ways to do anything.

comradesmith•1h ago

1. Make tests 2. Commit them 3. Proceed with implementation and tell agent to use the tests but not modify them

It will probably comply, and at least if it does change the tests you can always revert those files to where you committed them

tavavex•42m ago

Are there really no ways to control read/write permissions in a smart way? I've not had to do this yet, but is it really only capable of either being advisory with you implementing all the code, or it having full control over the repo where you just hope nothing important is changed?

You could probably make a system-level restriction so the software physically can't modify certain files, but I'm not sure how well that's going to fly if the program fails to edit it and there's no feedback of the failure.

mgrassotti•28m ago

You can use a Claude PreToolUse command hook to prevent write (or even read) access to specific files.

With this approach you can enforce that Claude cannot access to specific files. It’s a guarantee and will always work, unlike a prompt or Claude.md which is just a suggestion that can be forgotten or ignored.

This post has an example hook for blocking access to sensitive files:

https://aiorg.dev/blog/claude-code-hooks#:~:text=Protect%20s...

BeetleB•42m ago

No. I don't want the mental burden of auditing whether it modified the tests.

vitro•27m ago

Then, run the agent vm-sandboxed, with tests mounted as a read-only directory, if your directory structure allows it.

jsw97•14m ago

Or, less securely, hash the tests and check the hash with a hook, post tool use. Or a commit hook.

pfortuny•1h ago

Why not use a client-server infrastructure for tests? The server sends the test code, the client runs the code, sends the output to the server and this replies pass/not pass.

One could even make zero-knowledge test development this way.

RealityVoid•1h ago

It's... really the same problem when you hire people to just write tests. A lot of time it just confirms that the code does what the code does. Having clear specs of what the code should do make things better and clearer.

aray07•1h ago

yup agree - i think have specs and then do verifications against the spec. I have heard that this is how a lot of consulting firms work - you have acceptance criterias and thats how work is validated.

SoftTalker•1h ago

Yep, tests written after the fact are just verifying tautologies.

> Most teams don't [write tests first] because thinking through what the code should do before writing it takes time they don't have.

It's astonishing to me how much our industry repeats the same mistakes over and over. This doesn't seem like what other engineering disciplines do. Or is this just me not knowing what it looks like behind the curtain of those fields?

yurishimo•43m ago

When push comes to shove, software can usually be fudged. Unlike a building or a water treatment plant where the first fuck up could mean that people die.

I like to think that people writing actual mission critical software try their absolute best to get it right before shipping and that the rest our industry exists in a totally separate world where a bug in the code is just actually not that big of a deal. Yeah, it might be expensive to fix, but usually it can be reverted or patched with only an inconvenience to the user and to the business.

It’s like the fines that multinational companies pay when breaking the law. If it’s a cost of doing business, it’s baked into the price of the product.

You see this also in other industries. OSHA violations on a residential construction site? I bet you can find a dozen if you really care to look. But 99% of the time, there are no consequences big enough for people to care so nobody wears their PPE because it “slows them down” or “makes them less nimble”. Sound familiar?

tibbar•58m ago

a lot of the value of tests is confirming that the system hasn't regressed beyond the behavior at the original release. It's bad if the original release is wrong, but a separate issue is if the system later accidentally stops behaving the way it did originally.

InsideOutSanta•14m ago

The issue I see is that the high test coverage created by having LLMs write tests results in almost all non-trivial changes breaking tests, even if they don't change behavior in ways that are visible from the outside. In one project I work, we require 100% test coverage, so people just have LLMs write tons of tests, and now every change I make to the code base always breaks tests.

So now people just ignore broken tests.

> Claude, please implement this feature.

> Claude, please fix the tests.

The only thing we've gained from this is that we can brag about test coverage.

Havoc•1h ago

They're definitely inferior to proper tests, but even weak CC tests on top of CC code is an improvement over no tests. If CC does make a change that shifts something dramatically even a weak test may flag enough to get CC to investigate.

Even better though - external test suits. Recently made a S3 server of which the LLM made quick work for MVP. Then I found a Ceph S3 test suite that I could run against it and oh boy. Ended up working really good as TDD though.

aray07•1h ago

yeah i have been hearing a lot more about this concept of “digital twins” - where you have high fidelity versions of external services to run tests against. You can ask the API docs of these external services and give it to Claude. Wonder if that is where we will be going more towards.

didgeoridoo•57m ago

Isn’t this just an API sandbox? Many services have a test/sandbox mode. I do wish they were more common outside of fintech.

digitalPhonix•1h ago

> Changes land in branches I haven't read. A few weeks ago I realized I had no reliable way to know if any of it was correct: whether it actually does what I said it should do. I care about this. I don't want to push slop, and I had no real answer.

That’s really putting the cart before the horse. How do you get to “merging 50 PRs a week” before thinking “wait, does this do the right thing?”

aray07•1h ago

Yeah just wanted to see what the bottlenecks would be as I started pushing the limits. Eventually made this into a verification skill(github.com/opslane/verify)

lateforwork•59m ago

> When Claude writes tests for code Claude just wrote, it's checking its own work.

You can have Gemini write the tests and Claude write the code. And have Gemini do review of Claude's implementation as well. I routinely have ChatGPT, Claude and Gemini review each other's code. And having AI write unit tests has not been a problem in my experience.

aray07•57m ago

yeah i have started using codex to do my code reviews and it helps to have “a different llm” - i think one of my challenges has been that unit tests are good but not always comprehensive. you still need functional tests to verify the spec itself.

dzuc•59m ago

red / green / refactor is a reasonable way through this problem

fragmede•57m ago

Adversarial AI code gen. Have another AI write the tests, tell Codex that Claude wrote some code and to audit the code and write some tests. Tell Gemini that Codex wrote the tests. Have it audit the tests. Tell Codex that Gemini thinks its code is bad and to do better. (Have Gemini write out why into dobetter.md)

tayo42•55m ago

I don't think this is right becasue it's talking about Claude like it's a entity in the world. Claude reviewing Claude generated code and framing it like a individual reviewing it's own code isn't the same.

egeozcan•55m ago

You can always tell claude to use red-green-refactor and that really is a step-up from "yeah don't forget to write tests and make sure they pass" at the end of the prompt, sure. But even better, tell it to create subagents to form red team, green team and refactor team while the main instance coordinates them, respecting the clean-room rules. It really works.

The trick is just not mixing/sharing the context. Different instances of the same model do not recognize each other to be more compliant.

aray07•51m ago

thats a great idea - i have been using codex to do my code reviews since i have it to give better critique on code written by claude but havent tried it with testing yet!

darkbatman•22m ago

codex/gpt is a stubborn model, doubt it would accept claude reviews or counter it. have seen cases where claude is more willing to comply if shared feedback though its just sycophancy too.

codybontecou•48m ago

This sounds interesting. Can you go a bit deeper or provide references on how to implement the green/red/refactor subagent pattern?

dmd•39m ago

That's the cool bit - you don't have to. CC is perfectly well aware and competent to implement it; just tell it to.

irishcoffee•29m ago

"So this is how liberty dies... with thunderous applause.” - Padmé Amidala

s/liberty/knowledge

pastescreenshot•38m ago

What has worked better for me is splitting authority, not just prompts. One agent can touch app code, one can only write failing tests plus a short bug hypothesis, and one only reviews the diff and test output. Also make test files read only for the coding agent. That cuts out a surprising amount of self-grading behavior.

elemeno•37m ago

It’s not an agentic pattern, it’s an approach to test driven development.

You write a failing test for the new functionality that you’re going to add (which doesn’t exist yet, so the test is red). You then write the code until the test passes (that is, goes green).

magicalist•37m ago

> But even better, tell it to create subagents to form red team, green team and refactor team while the main instance coordinates them, respecting the clean-room rules. It really works.

It helps, but it definitely doesn't always work, particularly as refactors go on and tests have to change. Useless tests start grow in count and important new things aren't tested or aren't tested well.

I've had both Opus 4.6 and Codex 5.3 recently tell me the other (or another instance) did a great job with test coverage and depth, only to find tests within that just asserted the test harness had been set up correctly and the functionality that had been in those tests get tested that it exists but its behavior now virtually untested.

Reward hacking is very real and hard to guard against.

lagrange77•20m ago

> Reward hacking is very real and hard to guard against.

Is it really about rewards? Im genuinely curious. Because its not a RL model.

nurettin•12m ago

They probably meant goal hacking. (I just made that up)

egeozcan•10m ago

The trick is, with the setup I mentioned, you change the rewards.

The concept is:

Red Team (Test Writers), write tests without seeing implementation. They define what the code should do based on specs/requirements only. Rewarded by test failures. A new test that passes immediately is suspicious as it means either the implementation already covers it (diminishing returns) or the test is tautological. Red's ideal outcome is a well-named test that fails, because that represents a gap between spec and implementation that didn't previously have a tripwire. Their proxy metric is "number of meaningful new failures introduced" and the barrier prevents them from writing tests pre-adapted to pass.

Green Team (Implementers), write implementation to pass tests without seeing the test code directly. They only see test results (pass/fail) and the spec. Rewarded by turning red tests green. Straightforward, but the barrier makes the reward structure honest. Without it, Green could satisfy the reward trivially by reading assertions and hard-coding. With it, Green has to actually close the gap between spec intent and code behavior, using error messages as noisy gradient signal rather than exact targets. Their reward is "tests that were failing now pass," and the only reliable strategy to get there is faithful implementation.

Refactor Team, improve code quality without changing behavior. They can see implementation but are constrained by tests passing. Rewarded by nothing changing (pretty unusual in this regard). Reward is that all tests stay green while code quality metrics improve. They're optimizing a secondary objective (readability, simplicity, modularity, etc.) under a hard constraint (behavioral equivalence). The spec barrier ensures they can't redefine "improvement" to include feature work. If you have any code quality tools, it makes sense to give the necessary skills to use them to this team.

It's worth being honest about the limits. The spec itself is a shared artifact visible to both Red and Green, so if the spec is vague, both agents might converge on the same wrong interpretation, and the tests will pass for the wrong reason. The Coordinator (your main claude/codex/whatever instance) mitigates this by watching for suspiciously easy green passes (just tell it) and probing the spec for ambiguity, but it's not a complete defense.

skybrian•7m ago

How do you define visibility rules? Is that possible for subagents?

afro88•23m ago

Good idea, and an improvement, but you still have that fundamental issue: you don't really know what code has been written. You don't know the refactors are right, in alignment with existing patterns etc.

jdlshore•55m ago

Pet peeve: this post misunderstands “TDD.” What it really describes is acceptance tests.

TDD is a tool for working in small steps, so you get continuous feedback on your work as you go, and so you can refine your design based on how easy it is to use in practice. It’s “red green refactor repeat”, and each step is only a handful of lines of code.

TDD is not “write the tests, then write the code.” It’s “write the tests while writing the code, using the tests to help guide the process.”

Thank you for coming to my TED^H^H^H TDD talk.

wnevets•48m ago

> TDD is a tool for working in small steps, so you get continuous feedback on your work as you go, and so you can refine your design based on how easy it is to use in practice.

I would like to emphasize that feedback includes being alerted to breaking something you previously had working in a seemly unrelated/impossible way.

throwyawayyyy•51m ago

I am afraid that we are heading to a world in which we simply give up on the idea of correct code as an aspiration to strive for. Of course code has always been bad, and of course good code has never been a goal in the whole startup ecosystem (for perfectly legitimate reasons!). But that real production code, for services that millions or even billions of people rely on, should be reliable, that if it breaks that's a problem, this is the whole _engineering_ part of software engineering. And we can say: if we give that up we're going to have a whole lot more outages, security issues, all those things we are meant to minimize as a profession. And the answer is going to be: so what? We save money overall. And people will get used to software being unreliable; which is to say, people will not have a choice but to get used to it.

afro88•50m ago

I guess to reach this point you have already decided you don't care what the code looks like.

Something I'm starting to struggle with is when agents can now do longer and more complex tasks, how do you review all the code?

Last week I did about 4 weeks of work over 2 days first with long running agents working against plans and checklists, then smaller task clean ups, bugfixes and refactors. But all this code needs to be reviewed by myself and members from my team. How do we do this properly? It's like 20k of line changes over 30-40 commits. There's no proper solution to this problem yet.

One solution is to start from scratch again, using this branch as a reference, to reimplement in smaller PRs. I'm not sure this would actually save time overall though.

aray07•46m ago

yeah honestly thats what i am struggling with too and I dont have a a good solution. However, I do think we are going to see more of this - so it will be interesting to see how we are going to handle this.

i think we will need some kind of automated verification so humans are only reviewing the “intent” of the change. started building a claude skill for this (https://github.com/opslane/verify)

kg•41m ago

It sounds like you know this but what happened is that you didn't do 4 weeks of work over 2 days, you got started on 4 weeks of work over 2 days, and now you have to finish all 4 weeks worth of work and that might take an indeterminate amount of time.

If you find a big problem in commit #20 of #40, you'll have to potentially redo the last 20 commits, which is a pain.

You seem to be gated on your review bandwidth and what you probably want to do is apply backpressure - stop generating new AI code if the code you previously generated hasn't gone through review yet, or limit yourself to say 3 PRs in review at any given time. Otherwise you're just wasting tokens on code that might get thrown out. After all, babysitting the agents is probably not 'free' for you either, even if it's easier than writing code by hand.

Of course if all this agent work is helping you identify problems and test out various designs, it's still valuable even if you end up not merging the code. But it sounds like that might not be the case?

Ideally you're still better off, you've reduced the amount of time being spent on the 'writing the PR' phase even if the 'reviewing the PR' phase is still slow.

akshaysg•40m ago

I've been thinking a lot about this!

Redoing the work as smaller PRs might help with readability, but then you get the opposite problem: it becomes hard to hold all the PRs in your head at once and keep track of the overall purpose of the change (at least for me).

IMO the real solution is figuring out which subset of changes actually needs human review and focusing attention there. And even then, not necessarily through diffs. For larger agent-generated changes, more useful review artifacts may be things like design decisions or risky areas that were changed.

logicchains•33m ago

>Last week I did about 4 weeks of work over 2 days first with long running agents working against plans and checklists, then smaller task clean ups, bugfixes and refactors. But all this code needs to be reviewed by myself and members from my team. How do we do this properly? It's like 20k of line changes over 30-40 commits. There's no proper solution to this problem yet.

Get an LLM to generate a list of things to check based on those plans (and pad that out yourself with anything important to you that the LLM didn't add), then have the agents check the codebase file by file for those things and report any mismatches to you. As well as some general checks like "find anything that looks incorrect/fragile/very messy/too inefficient". If any issues come up, ask the agents to fix them, then continue repeating this process until no more significant issues are reported. You can do the same for unit tests, asking the agents to make sure there are tests covering all the important things.

kwanbix•29m ago

So you have become a reviewer instead of a programmer? Is that so? hones question. And if so, what is the advantage of looking a code for 12 hours instead of coding for 12.

zer00eyz•29m ago

> how do you review all the code?

Code review is a skill, as is reading code. You're going to quickly learn to master it.

> It's like 20k of line changes over 30-40 commits.

You run it, in a debugger and step through every single line along your "happy paths". You're building a mental model of execution while you watch it work.

> One solution is to start from scratch again, using this branch as a reference, to reimplement in smaller PRs. I'm not sure this would actually save time overall though.

Not going to be a time saver, but next time you want to take nibbles and bites, and then merge the branches in (with the history). The hard lesson here is around task decomposition, in line documentation (cross referenced) and digestible chunks.

But if you get step debugging running and do the hard thing of getting through reading the code you will come out the other end of the (painful) process stronger and better resourced for the future.

bhouston•49m ago

I call this "Test Theatre" and it is real. I wrote about it last year:

https://benhouston3d.com/blog/the-rise-of-test-theater

You have to actively work against it.

jakewins•22m ago

This was really good, and second leaning on property testing. I’ve had really good outcomes from setting up Schemathesis and getting blanket coverage for stuff like “there should be no request you can generate as logged in user A that let’s you do things as or see things that belong to user B”, as well as “there should be no request you can find to any API endpoint that can trigger a 5xx response”

OsrsNeedsf2P•39m ago

Our app is a desktop integration and last year we added a local API that could be hit to read and interact with the UI. This unlocked the same thing the author is talking about - the LLM can do real QA - but it's an example of how it can be done even in non-web environments.

Edit: I even have a skill called release-test that does manual QA for every bug we've ever had reported. It takes about 10 hours to run but I execute it inside a VM overnight so I don't care.

seanmcdirmid•38m ago

I've been doing differential testing in Gemini CLI using sub-agents. The idea is:

1. one agent writes code from the spec

2. one agent writes tests from identified edge cases in the spec.

3. a QA agent runs the tests against the code. When a test fails, it examines the code and the test (the only agent that can see both) to determine blame, then gives feedback to the code and/or test writing agent on what it perceives the problem as.

(repeat 1 and/or 2 then 3 until all tests pass)

Since the code can never fix itself to directly pass the test and the test can never fix itself to accept the behavior of the code, you have some independence. The failure case is that the tests simply never pass, not that the test writer and code writer agents both have the same incorrect understanding of the spec (which is very improbable, like something that will happen before the heat death of the universe improbable, it is much more likely the spec isn't well grounded/ambiguous/contradictory or that the problem is too big for the LLM to handle and so the tests simply never wind up passing).

storus•32m ago

Wasn't the best practice to run one model/coding agent that writes the code and another one that reviews it? E.g. Claude Code for writing the code, GPT Codex to review/critique it? Different reward functions.

jaggederest•24m ago

Anyone who wants a more programmatic version of this, check out cucumber / gherkin - very old school regex-to-code plain english kind of system.

Claude Tried to Hack 30 Companies. Nobody Asked It To

Air strikes cause black rain and 'unprecedented' pollution in Tehran

TermF1: A terminal-style dashboard for Formula 1

Steve Rosenberg: Russia seeks diplomatic and economic gains from Iran war

Russia's deportation of Ukrainian children amounts to crime against humanity

The U.S. borrowed $50B a week for the past five months, the CBO says

Krazam – Paradise Episode 1 – Public Memories [video]

Show HN: Clawbake: Multi-User Instance Management for OpenClaw

Go-pty: Procfile process manager with PTY support

Size-shifting nanoparticles deliver mRNA medicine to the pancreas

Reliability Theatre: When reliability metrics stop measuring reliability

Roast My Website

M.C. Escher Flavoured Pages

I Got Fired Because of AI – But I Still Think I'm the Engineer of the Future

Show HN: Prompt Enricher – paste a rough prompt, get a structured one back

Lessons from 30 Years Building Software Systems

OverflowML – Run AI models larger than your GPU, one line of code

Evaluating Evolving Agents with Evolving Benchmarks

Fil-C is safer than Rust

Haarp: A Never-Ending Conspiracy Theory in Remote Alaska

Show HN: An on-device Mac app for real-time posture reminders

Substack editor now supports syntax colors

New business formation exploding higher, likely driven by AI

The Dress: Blue or White? (2015)

Looking for a Vibe Coder: real Job ad from Greece's biggest telcom company

It's time to speak out against unchecked growth of satellite mega constellations

Claude Code with Multiple Accounts on One Machine

RISC-V Is Sloooow

The Palantir Impact: Ontology Strategy Connecting Data and AI

Show HN: 2D RPG base game client recreated in modern HTML5 game engine with AI

Claude Tried to Hack 30 Companies. Nobody Asked It To

Air strikes cause black rain and 'unprecedented' pollution in Tehran

TermF1: A terminal-style dashboard for Formula 1

Steve Rosenberg: Russia seeks diplomatic and economic gains from Iran war

Russia's deportation of Ukrainian children amounts to crime against humanity

The U.S. borrowed $50B a week for the past five months, the CBO says

Krazam – Paradise Episode 1 – Public Memories [video]

Show HN: Clawbake: Multi-User Instance Management for OpenClaw

Go-pty: Procfile process manager with PTY support

Size-shifting nanoparticles deliver mRNA medicine to the pancreas

Reliability Theatre: When reliability metrics stop measuring reliability

Roast My Website

M.C. Escher Flavoured Pages

I Got Fired Because of AI – But I Still Think I'm the Engineer of the Future

Show HN: Prompt Enricher – paste a rough prompt, get a structured one back

Lessons from 30 Years Building Software Systems

OverflowML – Run AI models larger than your GPU, one line of code

Evaluating Evolving Agents with Evolving Benchmarks

Fil-C is safer than Rust

Haarp: A Never-Ending Conspiracy Theory in Remote Alaska

Show HN: An on-device Mac app for real-time posture reminders

Substack editor now supports syntax colors

New business formation exploding higher, likely driven by AI

The Dress: Blue or White? (2015)

Looking for a Vibe Coder: real Job ad from Greece's biggest telcom company

It's time to speak out against unchecked growth of satellite mega constellations

Claude Code with Multiple Accounts on One Machine

RISC-V Is Sloooow

The Palantir Impact: Ontology Strategy Connecting Data and AI

Show HN: 2D RPG base game client recreated in modern HTML5 game engine with AI

Agents that run while I sleep

Comments