Opus 4.7 vs. 4.6 after 3 days of real coding side by side from my actual session

5•agentseal•9h ago

I spent some time today comparing Opus 4.6 and 4.7 using my own usage data to see how they actually behave side by side.

still pretty early for 4.7, but a few things surprised me.

In my sessions, 4.7 gets things right on the first try less often than 4.6. One-shot rate sits around 74.5% vs 83.8%, and I am seeing roughly double the retries per edit (0.46 vs 0.22).

It also produces a lot more output per call, about 800 tokens vs 372 on 4.6, which makes it noticeably more expensive. cost per call is $0.185 vs $0.112.

when I broke it down by task type, coding and debugging both looked weaker on 4.7. Coding one-shot dropped from 84.7% to 75.4%, debugging from 85.3% to 76.5%. Feature work was slightly better on 4.7 (75% vs 71.4%), but the sample is small. Delegation showed a big gap (100% vs 33.3%), though that one only has 3 samples on the 4.7 side so I wouldnt read much into it yet.

4.7 also uses fewer tools per turn (1.83 vs 2.77) and barely delegates to subagents (0.6% vs 3.1%). Not sure yet if that's a style difference or just the smaller sample.

A couple of caveats. This is about 3 days of 4.7 data (3,592 calls) vs 8 days of 4.6 (8,020 calls). Some categories only have a handful of examples. These numbers will shift with more usage, and your results will probably look different depending on what kind of work you do.

npx codeburn compare

Comments

alegd•8h ago

interesting data. I use Claude Code daily and noticed 4.7 feels different but couldnt put numbers to it like this.

does your one-shot rate account for how much context you give it? I keep a detailed CLAUDE.md with project conventions and wondering if that closes the gap at all or if 4.7 just struggles regardless.

the fewer tools per turn thing worries me. Are you seeing it hallucinate project structure more? In my sessions it seems to want to figure things out in its head instead of actually reading the files

More expensive and lower first-try accuracy is rough. You planning to stick with 4.7 or going back?

alwillis•7h ago

Anthropic provides details regarding between Opus 4.7 and 4.6, including Opus 4.7 doesn't call tools as frequently as 4.6 due to being more capable. Depending on the task at hand, that could a good thing or not so good [1].

For example, regarding instruction following:

Claude Opus 4.7 interprets prompts more literally and explicitly than Claude Opus 4.6, particularly at lower effort levels. It will not silently generalize an instruction from one item to another, and it will not infer requests you didn't make.

[1]: https://platform.claude.com/docs/en/build-with-claude/prompt...

alegd•6h ago

That explains a lot actually. So the fewer tool calls its by design. Makes sense but for coding specifically I'd rather it read my files than guess whats in them.

agentseal•3h ago

The one-shot rate doesn't factor in context size directly, it just tracks whether an edit succeeded without retries. That said, a detailed CLAUDE.md probably helps both models equally since the context is the same either way. Would be interesting to isolate that though.

I have started to rollback to 4.6 for some important task as I was working with it from longtime but I am still using 4.7 for some fresh task.

agentseal•3h ago

On the fewer tools per turn, yeah I think that lines up with what the other reply mentioned about 4.7 being more "in its head." I have not specifically tracked hallucinated project structure but the higher retry rate suggests it is getting things wrong more often when it skips the read step

Ask HN: How did you land your first projects as a solo engineer/consultant?

Opus 4.7 vs. 4.6 after 3 days of real coding side by side from my actual session

Ask HN: What makes a good Product Manager

Ask HN: Who is using OpenClaw?

Ask HN: May be a basic question, but how can I use AI well?

Ask HN: Building a solo business is impossible?

Tell HN: Fiverr left customer files public and searchable

I built a real-time AR plane spotter, here's the math that makes it work

Tell HN: 48 absurd web projects – one every month

Why don't we just ask AI to write assembler?

Ask HN: Anyone know of that "levels of AI programming" blog post?

Ask HN: Does magic link authentication use HTML canvassing?

Ask HN: Getting depressed day by day, how to cope?

Ask HN: How did you get your first users with zero audience?

Ask HN: ChatAi web-based session notation?

Ask HN: How do you find motivation to do stuff?

Ask HN: How do you maintain flow when vibe coding?

Aliens.gov Resolves – To a WordPress "Site Not Found" Error

Do I Stop Learning Coding? DSA?

Tell HN: Security Incident at Porter (YC S20)

Durable Object alarm loop: $34k in 8 days, zero users, no platform warning

Ask HN: How are you using LLMs in production?

Advice for tracking down a listening device?

Ask HN: Who is your favourite Entrepreneur/Visionary?

Ask HN: How do you search the web programmatically these days?

Tell HN: Anthropic no longer allows you to fix to specific model version

Ask HN: Teaching life skills through games, am I crazy?

Ask HN: Is Claude Getting Worse?

Ask HN: How are you actively keeping your thinking sharp while using LLMs daily?

Surely no brand is more hated by web users that Cloudflare

Opus 4.7 vs. 4.6 after 3 days of real coding side by side from my actual session

Comments

Ask HN: How did you land your first projects as a solo engineer/consultant?

Opus 4.7 vs. 4.6 after 3 days of real coding side by side from my actual session

Ask HN: What makes a good Product Manager

Ask HN: Who is using OpenClaw?

Ask HN: May be a basic question, but how can I use AI well?

Ask HN: Building a solo business is impossible?

Tell HN: Fiverr left customer files public and searchable

I built a real-time AR plane spotter, here's the math that makes it work

Tell HN: 48 absurd web projects – one every month

Why don't we just ask AI to write assembler?

Ask HN: Anyone know of that "levels of AI programming" blog post?

Ask HN: Does magic link authentication use HTML canvassing?

Ask HN: Getting depressed day by day, how to cope?

Ask HN: How did you get your first users with zero audience?

Ask HN: ChatAi web-based session notation?

Ask HN: How do you find motivation to do stuff?

Ask HN: How do you maintain flow when vibe coding?

Aliens.gov Resolves – To a WordPress "Site Not Found" Error

Do I Stop Learning Coding? DSA?

Tell HN: Security Incident at Porter (YC S20)

Durable Object alarm loop: $34k in 8 days, zero users, no platform warning

Ask HN: How are you using LLMs in production?

Advice for tracking down a listening device?

Ask HN: Who is your favourite Entrepreneur/Visionary?

Ask HN: How do you search the web programmatically these days?

Tell HN: Anthropic no longer allows you to fix to specific model version

Ask HN: Teaching life skills through games, am I crazy?

Ask HN: Is Claude Getting Worse?

Ask HN: How are you actively keeping your thinking sharp while using LLMs daily?

Surely no brand is more hated by web users that Cloudflare