This section really stood out to me. I knew that asking GPT-5 to think gets better results, but I didn't know Claude had the same behavior. I'd love to see some sort of % success before and after "think deep" was added. Should I be adding "think deep" to all my non-trivial queries to Claude?
If you use the word "think" at any point, this mode will also trigger...which is sometimes inconvenient.
Plan mode is probably better for these situations though. Claude seems to have got much worse recently but even before, if it was stuck, then asking it so "super think" never dislodged it (presumably because AI agents have no capacity for creativity, telling them to think more is just looping harder).
The loop of plan -> do is significantly more effective as most AI agents that I have used extensively will get lost on trivial tasks if you just have the "do" phase without a clear plan (some, such as GPT, also appear unable to plan effectively...we have a contract with OpenAI at work, I have no idea how people use it).
think / think hard / think harder / ultrathink
These are all valid claude commands/tokens to enhance its "thinking" abilities."think" < "think hard" < "think harder" < "ultrathink"
[1] https://www.anthropic.com/engineering/claude-code-best-pract...
Any such % would be meaningless because both are non-deterministic blackboxes with literally undefined behavior. Any % you'd be seeing could be just differences in available compute as the US is waking up
The solution to running only e2e tests on affected files has been around long before LLM. This is a bandage on poor CI.
> Did Claude catch all the edge cases? Yes, and I'm not exaggerating. Claude never missed a relevant E2E test. But it tends to run more tests than needed, which is fine - better safe than sorry.
If you have some particular issue with the author's methodology, you should state that.
We don't know the methodology, since the author does not state how he verified that function or how he would verify the function for a large code base.
There's a handy list to check against the article here: https://dmitriid.com/everything-around-llms-is-still-magical... starting at "For every description of how LLMs work or don't work we know only some, but not all of the following"
- Do we know which projects people work on?
It's pretty easy to discover that OP works on https://livox.com.br/en/, a tool that uses AI to let people with disabilities speak. That sounds like a reasonable project to me.
- Do we know which codebases (greenfield, mature, proprietary etc.) people work on
The e2e tests took 2 hours to run and the website quotes ~40M words. That is not greenfield.
- Do we know the level of expertise the people have?
It seems like they work on nontrivial production apps.
- How much additional work did they have reviewing, fixing, deploying, finishing etc.?
The article says very little.
This doesn't work in distributed systems, since changing the behavior of one file that's compiled in one binary can cause a downstream issue in a separate binary that sends a network call to the first. e.g. A programmer makes a behavioral change to binary #1 that falls within defined behavior, but encounters Hyrum's Law because of a valid behavior of binary #2.
- avoid distributed systems at all costs
- if you can't avoid them, never make breaking API changes
Running all E2E tests in a pipeline isn't feasible due to time constraints (takes hours). Most companies just run these tests nightly (and we still do). Which means we would still catch any issues that slip through the initial screening. But so far, nothing did.
Risks: missing e2e tests that should have run letting bugs into production, more time spent chasing down flakes due to non determinism.
Benefits: increased productivity, catch bugs sooner (since you can run e2e tests more often).
- Smoke tests (does this page load at all) - More narrow tests (does this particular flow work)
I think you're referring to smoke tests, but you likely always want to run smoke tests. It's the narrow tests that you are safe with removing.
At Meta we do similar heuristics on which tests to run per PR. When the system gets it wrong, which is often, its painful and leads to merged code that broke unrun tests
Like anything it's about tradeoffs. Though if it were me i'd simply write a mechanism which deterministically decides which areas of code pertain to which tests and use that simple program to determine which tests run. The algorithm could be loosely as simple as things like code owners and git-blame, but relative to some big set of Code->Test that you can have Claude/etc build ahead of time. The difference being it's deterministic between PRs and can be audited by humans for anything obviously poor or w/e.
As much as LLMs are interesting to me i hate using them in places where i want consistency. CI seems terrible for them.. to me at least.
A good middle ground could be to allow the diff to land once the “AI quick check” passed, then keep the full test suite running in the background. If they run them side by side for a while and see that the AI quick check caught the failing test every time, I’d be convinced.
"We" and "win" are both doing a lot of heavy-lifting here, as they are whenever talks about LLM labor destruction.
So this is in theory a good thing: an LLM replacing a tedious task that no one was going to be hired to do anyway.
And besides, labor destruction could be a truly wonderful thing, if we had a functional, empathetic society in which we a) ensure that people have paths to retrain for new jobs while their basic needs are met by the state (or by the companies doing the labor destruction), and/or b) allow people whose jobs are destroyed to just not work at all, but provide them with enough resources to still have a decent life (UBI plus universal health care basically).
My utopia is one where there is no scarcity, and no one has to work to survive and have a good life. If I could snap my fingers and eliminate the need for every single job, and replace it with post-scarcity abundance, I would. People would build things and do science and help others because it gives them pleasure to do so, not because they have to in order to survive. And for people who just want to live lives of leisure, that would be fine, too. I don't think humanity will ever get to this state, mind you, but I can dream of a better world.
> what if we could run only the relevant E2E tests
The real title should be "Using Claude Code to Reduce E2E Tests by 84%."
Article flags should be reserved for things you don't believe should be on HN at all.
I do wonder if this is as feasible at scale, where breaking master can be extremely costly (although at least it’s not running all tests for all commits, so a broken test won’t break all CI runs). Maybe it could be paired with, say, running all E2E tests post-merge and reporting breakages ASAP.
DHH recently described[0] the approach they've taken at BaseCamp, reducing ~180 comprehensive-yet-brittle system tests down to 10 good-enough smoke tests, and it feels much more in spirit with where I would recommend folks invest effort: teams have way more tests than they need for an adequate level of confidence. Code and tests are a liability, and, to paraphrase Kent Beck[1], we should strive to write the minimal amount of tests and code to gain the maximal amount of confidence.
The other wrinkle here is that we're often paying through the nose in costs (complexity, actual dollars spent on CI services) by choosing to run all the tests all the time. It's a noble and worthy goal to figure out how not to do that, _but_, I think the conclusion shouldn't be to throw more $$$ into that money-pit, but rather just use all the power we have in our local dev workstations + trust to verify something is in a shippable state, another idea DHH covers[2] in the Rails World 2025 keynote; the whole thing is worth watching IMO.
[0] - https://youtu.be/gcwzWzC7gUA?si=buSEYBvxcxNkY6I6&t=1752
[1] - https://stackoverflow.com/questions/153234/how-deep-are-your...
[2] - https://youtu.be/gcwzWzC7gUA?si=9zL-xWG4FUxYZMC5&t=1977
Teams need to periodically audit their tests, figure out what covers what, figure out what coverage is actually useful, and prune stuff that is duplicative and/or not useful.
OP says that ultimately their costs went down: even though using Claude to make these determinations is not cheap, they're saving more than they're paying Claude by running fewer tests (they run tests on a mobile device test farm, and I expect that can get pricey). But ultimately they might be able to save even more money by ditching Claude and deleting tests, or modifying tests to reduce their scope and runtime.
And at this point in the sophistication level of LLMs, I would feel safer about not having an LLM deciding which tests actually need to run to ensure a PR is safe to merge. I know OP says that so far they believe it's doing the right thing, but a) they mention their methodology for verifying this in a comment here[0], and I don't agree that it's a sound methodology[1], and b) LLMs are not deterministic and repeatable, so it could choose two very different sets of tests if run twice against the exact same PR. The risk of that happening may be acceptable, though; that's for each individual to decide.
So, it builds a dependency graph?
I've been playing with graph related things lately and it seems like there might be more efficient ways to do this than asking a daffy robot to do the job instead of a specific (AI crafted?) tool.
One could even get all fancy with said tool and use it to do fun and exciting things with the cross file dependencies like track down unused includes (or whatever) to improve build times.
1. Are you seeing a lower number of production hot fixes?
2. Have you tried other models? Or thought about using Output of different models and using combining the output if they have a delta.
3. Other than time & cost, what benchmarks in terms of software are considered (i.e., less hot fixes, etc.)?
This is really cool, btw> Are you seeing a lower number of production hot fixes?
Yes, with E2E tests in general: They are more effective at stopping incidents than other tests, but they require more effort to write. In my estimation, we prevent about 2-3 critical bugs per month from being merged into main (and consequently deployed).
For this project specifically: I think the critical bugs would have been caught in our overnight full E2E run anyway. The biggest gain was that E2E tests took too much time in the pipeline, and finding the root cause of bugs in nightly tests took even more time. When a test fails in the PR, we can quickly fix it before merging.
> Have you tried other models? Or thought about using output from different models and combining the results when they differ?
Not yet, but I think we need to start experimenting. Claude went offline for 30 minutes over the last 2 days, and engineers were blocked from merging because of it. I'm planning to add claude-code-router as a fallback.
Outside of a fun project to tinker on and see the results, I wouldn’t use this for anything that runs in production. It is better to use an LLM to assist you in building better static analysis tools, than it is to use the proposed technique in the article.
I was waiting to see this part demonstrated and validated - for a given pr, whether you created an expected set of tests that would run and then compare it to what actually ran. Without a baseline, as any tester would tell you, the LLM output has been trusted without checking.
The only way to verify that is to do what GP suggested: for each PR, manually make a list of the tests that you believe should be run, and then compare that to the list of tests that the LLM actually runs. Obviously you aren't going to do this forever, but you should either do it for every single PR for some number of PRs, or pick a representative sample over time instead.
brynary•2h ago
However, those methods are tightly bound to programming languages, frameworks, and interpreters so they are difficult to support across technology stacks.
This approach substitutes the intelligence of the LLM to make educated guesses about what tests execute, to achieve the same goal of executing all of the tests that could fail and none of the rest (balancing a precision/recall tradeoff). What’s especially interesting about this to me is that the same technique could be applied to any language or stack with minimal modification.
Has anyone seen LLMs in other contexts being substituted for traditional analysis to achieve language agnostic results?
wer232essf•2h ago
One interesting side effect I’ve noticed when experimenting with LLM-driven heuristics is that, while they lack the determinism of static/runtime analysis, they can sometimes capture semantic relationships that are invisible to traditional tooling. For example, a change in a configuration file or documentation might not show up in a dependency graph, but an LLM can still reason that it’s likely to impact certain classes of tests. That fuzziness can introduce false positives, but it also opens the door to catching categories of risk that would normally be missed.
I think the broader question is how comfortable teams are with probabilistic guarantees in their workflows. For some, the precision/recall tradeoff is acceptable if it means faster feedback and reduced CI bills. For others, the lack of hard guarantees makes it a non-starter. But I do see a pattern emerging where LLM-based analysis isn’t necessarily a replacement for traditional methods, but a complementary layer that can generalize across stacks and fill in gaps where traditional tools don’t reach.