But, there are a few nuggets we figured are worth sharing, like Anchor Comments [1], which have really made a difference:
——
# CLAUDE.md
### Anchor comments
Add specially formatted comments throughout the codebase, where appropriate, for yourself as inline knowledge that can be easily `grep`ped for.
- Use `AIDEV-NOTE:`, `AIDEV-TODO:`, or `AIDEV-QUESTION:` as prefix as appropriate.
- *Important:* Before scanning files, always first try to grep for existing `AIDEV-…`.
- Update relevant anchors, after finishing any task.
- Make sure to add relevant anchor comments, whenever a file or piece of code is:
* too complex, or
* very important, or
* could have a bug
——[1]: https://diwank.space/field-notes-from-shipping-real-code-wit...
1. Add instructions in CLAUDE.md to not touch tests.
2. Disallow the Edit tool for test directories in the project’s .claude/settings.json file
I meant though in the wider context of the team - everyone uses it but not everyone will work the same, use the same underlying prompts as they work. So how do you ensure everyone keeps to that agreement?
There's nothing specific to using Claude or any other automation tool here. You still use code reviews, linters, etc. to catch anything that isn't following the team norms and expectations. Either that or, as the article points out, someone will cause an incident and may be looking for a new role (or nothing bad happens and no one is the wiser).
As a very experienced engineer who uses LLMs sporadically* and not in any systematic way, I really appreciated seeing how you use them in production in a real project. I don’t know why people are being negative, you just mentioned your project in details where it was appropriate to talk about the structure of it. Doesn’t strike me as gratuitous self promotion at all.
Your post is giving me motivation to empower the LLMs a little bit more in my workflows.
*: They absolutely don’t get the keys to my projects but I have had great success with having them complete specific tasks.
And regarding the HN post getting buried for a while there...[1] Somewhat ironic that an article about using AI to help write code would get canned for using an AI to help write it :D
You mentioned that LLM should never touch tests. Then followed up with an example refactoring changing 500+ endpoints completed in 4 hours. This is impressive! I wonder if these 4 hours included test refactoring as well or it is just prompting time?
here's the original chat: https://chatgpt.com/share/6844eaae-07d0-8001-a7f7-e532d63bf8...
I also used bits from claude research but apparently if you use claude research, they don't let you create a share link -_-
I recently wrote a long Markdown document, and asked Claude, ChatGPT, Grok, and Gemini to improve it.
Comparing outputs, it was very close between Gemini and Claude, but I decided that Claude was slightly better-written.
One thing I would have liked to know is the difference between a workflow like this and the use of aider. If you have any perspective on that, it would be great.
Hi, AI skeptic with an open-mind here. How much will this cost me to try? I don't see that mentioned in your writeup.
I admit I have no idea what the real differences are. Everybody seems to claim to be the best and most comprehensive AI coding solution.
I typically try to also include the original Claude chat’s link in the post but it seems like Claude doesn’t allow sharing chats with deep research used in them.
Update: here’s an older chatgpt conversation while preparing this: https://chatgpt.com/share/6844eaae-07d0-8001-a7f7-e532d63bf8...
We’ve been asking the community to refrain from publicly accusing authors of posting LLM-generated articles and comments. But the other side of that is that we expect authors to post content thay they’ve created themselves.
It’s one thing to use an LLM for proof-reading and editing suggestions, but quite another for “60%” of an article to be LLM-generated. For that reason I’m having to bury the post.
Edit: I changed this decision after further information and reflection. See this comment for further details: https://news.ycombinator.com/item?id=44215719
Edit: On reflection, given your explanation of your use of AI and given another comment [1] I replied to below, I don't think this post is disqualified after all.
Tag it, let users decide how they want to vote.
Aside: meta: If you're speaking on behalf of HN you should indicate that in the post (really with a marker outside of the comment).
Regarding your other suggestion: it's been the case ever since HN started 18 years ago that moderators/modcomments don't have any special designation. This is due to our preference for simple design and an aversion to seeming separate from the community. We trust that people will work it out and that has always worked well here.
If I invent the wheel, and have an LLM write 90% of the article from bullet points and edit it down, don't we still want HN discussing the wheel?
Not to say that the current generation of AI isn't often producing boring slop, but there's nothing that says it will remain that way, and a percent-AI assistance seems like the wrong metric to chase to me?
Slop is slop, whether a human or AI wrote it--I don't want to read it. Great is great. Period. If a human or AI writes something great, I want to read it.
Assuming AI writing will remain slop is a bold assumption, even if it holds true for the next 24 hours.
“I didn't have time to write a short letter, so I wrote a long one instead.”
- Mark Twain
Absolutely not. Would much rather take some that is boring, not thought provoking but that was authentic and real rather than as you say AI slop.
If you want that sort of content maybe LinkedIn is a better place.
“I supplied the ideas” is literally the first thing anyone caught out using chatgpt to do their homework says… I’d tend to believe someones first statement instead of the backpedal once they’ve been chastised for it.
I typically try to also include the original Claude chat’s link in the post but it seems like Claude doesn’t allow sharing chats with deep research used in them.
See this series of posts for example, I have included the link right at the beginning: https://diwank.space/juleps-vision-levels-of-intelligence-pt...
I completely get the critique and I already talked about it earlier: https://news.ycombinator.com/item?id=44213823
Update: here’s an older chatgpt conversation while preparing this: https://chatgpt.com/share/6844eaae-07d0-8001-a7f7-e532d63bf8...
- Is there a more elegant way to organize the prompts/specifications for LLMs in a codebase? I feel like CLAUDE.md, SPEC.mds, and AIDEV comments would get messy quickly.
- What is the definition of "vibe-coding" these days? I thought it refers to the original Karpathy quote, like cowboy mode, where you accept all diffs and hardly look at code. But now it seems that "vibe-coding" is catch-all clickbait for any LLM workflow. (Tbf, this title "shipping real code with Claude" is fine)
- Do you obfuscate any code before sending it to someone's LLM?
Yeah, the comments do start to pile up. I’m working on a vscode extension that automatically turns them into tiny visual indicators in the gutter instead.
> - What is the definition of "vibe-coding" these days? I thought it refers to the original Karpathy quote, like cowboy mode, where you accept all diffs and hardly look at code. But now it seems that "vibe-coding" is catch-all clickbait for any LLM workflow. (Tbf, this title "shipping real code with Claude" is fine)
Depends on who you ask ig. For me, hasn’t been a panacea, and I’ve often run into issues (3.7 sonnet and codex have had ~60% success for me but Opus 4 is actually v good)
> - Do you obfuscate any code before sending it to someone's LLM?
In this case, all of it was open source to begin with but good point to think about.
1. We ran into really bad minefields when we tried to come back to manually edit the generated tests later on. Claude tended to mock everything because it didn’t have context about how we run services, build environments, etc.
2. And this was the worst, all of the devs on the team including me got realllyy lazy with testing. Bugs in production significantly increased.
maybe we could either try this with opus 4 and hope that cheaper models catch up, or just drink the kool-aid and switch to pytest...
Devs almost universally hate 3 things:
1. writing tests;
2. writing docs;
3. manually updating dependencies;
and LLMs are a big boon wrt to helping us avoiding all 3, but forcing your team to pick writing tests is a sensible trade off in this context, since as you say bugs in prod increased significantly.
But as a human, I do like the CLAUDE.md file. It's like documentation for dev reasoning and choices. I like that.
Is this faster than old style codebases but with developers having the LLM chat open as they work? Seems like this ups the learning curve. The code here doesn't look very approachable.
> One of the most counterintuitive lessons in AI-assisted development is that being stingy with context to save tokens actually costs you more
Something similar I've been thinking about recently: For bigger projects & more complicated code, I really do notice a big difference between Claude Opus and Claude Sonnet. And Sonnet sometimes just wastes so much time on ideas that never pan out, or make things worse. So I wonder: wouldn't it make more sense for Anthropic to not differentiate between Opus and Sonnet for people with a Max subscription? It seems like Sonnet takes 10-20 turns what Opus can do in 2 or 3, so in the end forcing people over to Sonnet would ultimately cost them more.
Simplest I’d say is: get the $100/mo Max and then npm install -g @anthropic-ai/claude-code
There's a good middle ground where the LLMs can help us solve problems faster optimizing for outcomes rather than falling in love with solving the problem. Many of us usually lose sight of the actual goal we're trying to achieve when we get distracted by implementation details.
Our style guide for humans is about 100 lines long, with lines like "Add a ! to the end of a function name if and only if it mutates one of its inputs". Our style guide for Claude is ~500 lines long, and equivalent sections have to include many examples like "do this, don't do this" to work.
> Never. Let. AI. Write. Your. Tests.
AI writes all of my tests now, but I review them all carefully. Especially for new code, you have to let the AI write tests if you want it to work autonomously. I explicitly instruct the AI to write tests and make sure they pass before stopping. I usually review these tests while the AI is implementing the code to make sure they make sense and cover important cases. I add more cases if they are inadequate.
I have been having a horrible experience with Sonnet 4 via Cursor and Web. It keeps cutting corners and misreporting what it did. These are not hallucinations. Threatening it with deletion (inspired by Anthropic's report) only makes things worse.
It also pathologically lies about non-programming things. I tried reporting it but the mobile app says "Something went wrong. Please try again later." Very bizarre.
Am I the only person experiencing these issues? Many here seem to adore Claude.
I find it's much better just to use Claude Web and be extremely specific about what I need it to do.
And even then half the code it generates for me is riddled with errors.
I remember versions 3.5 doing okay on my simple tasks like text analysis or summaries or little writing prompts. In 4+ versions the thing just can't follow instructions within a single context window for more than 3-4 replies.
When prompted about "why do you keep rambling if I asked you to stay concise" it says that its default settings are overriding its behavior and explicit user instructions, ditto for actively avoiding information that it considers "harmful". After pointing out inconsistencies and omissions in its replies it concedes that its behavior is unreliable and even extrapolates that it is made this way so users keep engaging with it for longer and more often.
Maybe it got too smart to its detriment, but if yes then it's really sad what Anthropic did to it.
Curious if the author (or others) tried other tools / models.
1. Claude Code with Opus 4
2. Cursor with Opus 4 or Gemini 2.5 Pro (Windsurf used to be an option but Anthropic has now cut them out)
3. (Coming up; still playing around) Claude Code’s GitHub Action
kasey_junk•8mo ago
You can use how uncomfortable you are with the ai doing something as a signal that you need to invest in systematic verification of that something. As a for instance in the link, the team could build a system for verifying and validating their data migrations. That would move a whole class of changes into the ai relm.
This is usually much easier to quantify and explain externally than nebulous talk about tech debt in that system.
diwank•8mo ago
tinodb•8mo ago
diwank•7mo ago
1. AIDEV-* is easier to visually distinguish, and
2. it is grep-able by the models, so they can "look around" the codebase in one glance
theptip•8mo ago