Claude Opus 4.1

https://www.anthropic.com/news/claude-opus-4-1

636•meetpateltech•8h ago

Comments

jasonlernerman•8h ago

Has anyone tested it yet? How's it acting?

smerrill25•8h ago

waiting for this, too.

usaar333•8h ago

No obvious gains I feel from quick chats, but too early to tell.

These benchmark gains aren't that high, so I doubt it is that obvious.

jedisct1•7h ago

Tested it on a refactor of Zig code. It worked fine, but was very slow.

minimaxir•8h ago

This likely won't move the needle for Opus use over Sonnet while the cost remains the same. Using OpenRouter rankings (https://openrouter.ai/rankings) as a proxy, Sonnet 3.7 and Sonnet 4 combined generates 17x more tokens than Opus 4.

qsort•8h ago

All three major labs released something within hours of each other. This anime arc is insane.

x187463•8h ago

Given the GPT5 rumors, August is just getting started.

kridsdale3•6h ago

Given the Gregorian Calendar and the planet's path through its orbit, August is just getting started.

tomrod•5h ago

This legitimately made me chuckle.

wunderg•1h ago

Good one, made my day

ozgung•8h ago

What a time to be alive

tonyhart7•7h ago

as if they wait competitor first then launch it at the same time to make market decide which one is best

torginus•5h ago

I think this means that GPT5 is better - you can't launch a worse model after the competitor supersedes you - you have to show that you're in the lead even if its just for a day.

rapind•3h ago

Not sure that this is true. Are there a lot of people waiting anxiously to adopt the next model on the day of release and expecting some huge work advantage?

azan_•3h ago

Absolutely.

dnh44•1h ago

If you’re using an LLM near the limits of what it can do then a small improvement in performance is noticeable.

vFunct•7h ago

None of them seem to have published any papers associated with them on how these new models advanced the state-of-the-art though. =^(

hugodan•3h ago

china will do that for them

candiddevmike•7h ago

It's definitely a coincidence

wilg•7h ago

It's not a coincidence or a cartel, it's PR counterprogramming.

BudaDude•2h ago

Agree 100%

If you look at the past, whenever Google announces something major, OpenAI almost always releases something as well.

People forget realize that OpenAI was started to compete with Google on AI.

Etheryte•6h ago

This is why you have PR departments. Being on top of the HN front page, news sites, etc matters a lot. Even if you can't be the first, it's important to dilute the attention as much as possible to reduce the limelight your competitors get.

paulryanrogers•1h ago

"Prep the next three point releases now, but don't release any until I say so. None needs to be noticably better or even different, just has to have a higher number." -CEO of AI companies

andai•25m ago

How do they know when it's time? Corporate espionage? Or do they just have Next Thing queued up months in advance and ready to go.

j45•4m ago

They likely sit on releases ready to go.

qoez•4m ago

There's so many leakers in every lab

steveklabnik•8h ago

This is the bit I'm most interested in:

> We plan to release substantially larger improvements to our models in the coming weeks.

machiaweliczny•8h ago

This is so people don't immediately migrate for GPT5

NitpickLawyer•8h ago

Cheekily announcing during oAI's oss model launch :D

haaz•8h ago

it is barely an improvement according to their own benchmarks. not saying thats a bad thing, but not enough for anybody to notice any difference

waynenilsen•8h ago

i think its probably mostly vibes but that still counts, this is not in the charts

> Windsurf reports Opus 4.1 delivers a one standard deviation improvement over Opus 4 on their junior developer benchmark, showing roughly the same performance leap as the jump from Sonnet 3.7 to Sonnet 4.

ttoinou•8h ago

That's why they named it 4.1 and not 4.5

zamadatix•7h ago

When it's "that's why they incremented the version by a tenth instead of a half" you know things have really started to slow for the large models.

phonon•7h ago

Opus 4 came out 10 weeks ago. So this is basically one new training run improvement.

zamadatix•7h ago

And in 52 weeks we've gone 3.5->4.1 with this training improvement, meanwhile the 52 weeks prior to that were Claude -> Claude 3. The absolute jumps per version delta also used to be larger.

I.e. it seems we don't get much more than new training run levels of improvement anymore. Which is better than nothing, but a shame compared to the early scaling.

globalise83•5h ago

Is it really a bigger jump to go from plausible to frequently useful, than from frequently useful to indispensable?

zamadatix•3h ago

Why is there supposed to be no step between frequently useful and indispensable? Quickly going from nothing to frequently useful (which involved many rapid hops between) was certainly surprising, and that's precisely the lost momentum.

mclau157•6h ago

They released this because competitors are releasing things

leetharris•8h ago

Good! I'm glad they are just giving us small updates. Opus 4 just came out, if you have small improvements, why not just release them? There's no downside for us.

AstroBen•8h ago

I don't think this could even be called an improvement? It's small enough that it could just be random chance

j_bum•7h ago

I’ve always wondered about this actually. My assumption is that they always “pick the best” result from these tests.

Instead, ideally they’d run the benchmark tests many times, and share all of the results so we could make statistical determinations.

gloosx•6h ago

They need to leave some room to release 10 more models. They could crank benchmarks to 100% but then no new model is needed lol? Pretty sure these pretty benchmark graphs are all completely staged marketing numbers since they do solve the same problems they are being trained on – no novel or unknown problematic is presented to them.

levocardia•5h ago

"You pay $20/mo for X, and now I'm giving you 1.05*X for the same price." Outrageous!

onlyrealcuzzo•5h ago

I will only add that it's interesting that in the results graphic, they simply highlighted Opus 4.1 - choosing not to display which models have the best scores - as Opus 4.1 only scored the best on about half of the benchmarks - and was worse than Opus 4.0 on at least one measure.

Topfi•4h ago

I am still very early, but output quality wise, yes, there does not seem to be any noticeable improvement in my limited personal testing suite. What I have noticed though is subjectively better adherence to instructions and documentation provided outside the main prompt, though I have no way to quantify or reliably test that yet. So beyond reliably finding Needles-in-the-Haystack (which Frontier models have done well on lately), Opus 4.1 seems to do better in following those needles even if not explicitly guided to compared to Opus 4.

jzig•8h ago

I'm confused by how Opus is presented to be superior in nearly every way for coding purposes yet the general consensus and my own experience seem to be that Sonnet is much much better. Has anyone switched to entirely using Opus from Sonnet? Or maybe switching to Opus for certain things while using Sonnet for others?

datameta•8h ago

I now eagerly await Sonnet 4.1, only because of this release.

rtfeldman•8h ago

Yes, Opus is very noticeably better at programming in both Rust and Zig in my experience. I wish it were cheaper!

MostlyStable•8h ago

Opus seems better to me on long tasks that require iterative problem solving and keeping track of the context of what we have already tried. I usually switch to it for any kind of complicated troubleshooting etc.

I stick with Sonnet for most things because it's generally good enough and I hit my token limits with it far less often.

unshavedyak•8h ago

Same. I'm on the $200 plan and I find Opus "better", but Sonnet is more straight forward. Sonnet is, to me, a "don't let it think" model. It does great if you give it concrete and small goals. Anything vague or broad and it starts thinking and it's a problem.

Opus gives you a bit more rope to hang yourself with imo. Yes, it "thinks" slightly better, but still not good enough to me. But it can be good enough to convince you that it can do the job.. so i dunno, i almost dislike it in this regard. I find Sonnet just easier to predict in this regard.

Could i use Opus like i do Sonnet? Yes definitely, and generally i do. But then i don't really see much difference since i'm hand-holding so much.

adastra22•8h ago

Every time that Sonnet is acting like it has brain damage (which is once or twice a day), I switch to Opus and it seems to sort things out pretty fast. This is unscientific anicdata though, and it could just be that switching models (any model) would have worked.

anonzzzies•8h ago

Exactly that.

j45•8h ago

They both seem to behave differently depending on how loaded the system seems to be.

api•8h ago

I have suspected for a long time that hosted models load shed by diverting some requests to lesser models or running more quantized versions under high load.

parineum•7h ago

I think OpenRouter saves tokens by summarizing queries through another model, IIRC.

monatron•8h ago

This is a great use case for sub-agents IMO. By default, sub-agents use sonnet. You can have opus orchestrate the various agents and get (close to) the best of both worlds.

adastra22•6h ago

Is there a way to get persistent sub-agents? I'd love to have a bunch of YAML files in my repository, one for each sub-agent, and have those automatically used across all Claude Code instances I have on multiple machines (I dev on laptop and desktop), or across the team.

mwigdahl•5h ago

Yep: https://docs.anthropic.com/en/docs/claude-code/sub-agents

rapind•3h ago

In this case I don't think the controller needs to be the smartest model. I use sonnet as the main driver and pass the heavy thinking (via zen mcp) onto Gemini pro for example, but I could use openai or opus or all of them via OpenRouter.

Subagents seem pretty similar to using zen mcp w/ OpenRouter but maybe better or at least more turnkey? I'll be checking them out.

mark_undoio•2h ago

Amp (ampcode.com) uses Sonnet as its main model and has GPT o3 as a special purpose tool / subagent. It can call into that when it needs particularly advanced reasoning.

Interestingly I found that prompting it to ask the o3 submodel (which they call The Oracle) to check Sonnet's working on a debugging solution was helpful. Extra interesting to me was the fact that Sonnet appeared to do a better job once I'd prompted that (like chain of thought prompting, perhaps asking it to put forward an explanation to be checked actually triggered more effective thinking).

riwsky•16m ago

Great, now even computers need to leave the IC track if they want continued career progression.

gpm•7h ago

This seems like a case of reversion to the mean. When one model is performing below average, changing anything (like switching to another model) is likely to improve it by random chance...

keeeba•4h ago

Anthropic say Opus is better, benchmarks & evals say Opus is better, Opus has more parameters and parameters determine how much a NN can learn.

Maybe Opus just is better

HarHarVeryFunny•7h ago

Maybe context rot? If model's output seems to be getting worse or in a rut, then try just clearing context / starting a new session.

adastra22•6h ago

Switching models with the same context, in this case.

dested•8h ago

If I'm using cursor then sonnet is better, but in claude code Opus 4 is at least 3x better than Sonnet. As with most things these days, I think a lot of it comes down to prompting.

jzig•8h ago

This is interesting. I do use Cursor with almost exclusively Sonnet and thinking mode turned on. I wonder if what Cursor does under the hood (like their indexing) somehow empowers Sonnet more. I do not have much experience with using Claude Code.

seunosewa•8h ago

It's ridiculously overpriced in the API. Just like o3 used to be.

brenoRibeiro706•8h ago

I feel the same way. I usually use Opus to help with coding and documentation, and I use Sonnet for emails and so on.

biinjo•7h ago

Im on the Max plan and generally Opus seems to do better work than Sonnet. However, that’s only when they allow me to use Opus. The usage limits, even on the max plan, are a joke. Yesterday I hit the limits within MINUTES of starting my work day.

epolanski•7h ago

Yeah, you need to actively cherry pick which model to use in order to not waste tokens on stuff that would be easily handed by a simpler model.

furyofantares•7h ago

I'm a bit confused by people hitting usage limits so quickly.

I use Opus exclusively and don't hit limits. ccusage reports I'm using the API-equivalent of $2000/mo

rirze•7h ago

You always have to ask which plan they're paying for. Sometimes people complain about the $20 per month plan...

stavros•7h ago

There's no Opus quota on that plan at all.

furyofantares•6h ago

In this case I'm replying to someone who lead with "I'm on the Max plan" but I realize now that's ambiguous, maybe they are on 5x while I'm on 20x.

Bolwin•6h ago

That's insane. Are you accounting for caching? If not, there's no way this is going to last

furyofantares•6h ago

I'm using ccusage to get the number, I think it just looks at your history and calculates based on tokens vs API pricing. So I think it wouldn't account for caching.

But I totally agree there's no way it lasts. I'm mostly only using this for side projects and I'm sitting there interacting with it, not YOLO'ing, I do sometimes have two sessions going at the same time but I'm not firing off swarms or anything crazy. Just have it set to Opus and I chat with it.

Aeolun•1h ago

Claude Code definitely reports cached tokens, and I think CCusage does too, so it wouldn’t make sense for the calculation to be based on full pricing when they have the cached values.

dsrtslnd23•5h ago

same here constantly hit the Opus limits after minutes on Max plan

Aeolun•1h ago

Is this on x5? Because ever since they booted all the freeloaders I’ve not once seen the “you are approaching usage limits” message. Anyway, the “you are approaching usage limits” message shows up when you are over 50% of your tokens for that timeframe, so it’s not sure useful.

gpm•7h ago

I notice that on the "Agentic Coding" benchmark cited in the article Sonnet 4 outperformed Opus 4 (by 0.2%), and under performs Opus 4.1 (by -1.8%).

So this release might change that consensus? If you believe the benchmarks are reflective of reality anyways.

jimbo808•7h ago

> If you believe the benchmarks are reflective of reality anyways.

That's a big "if." But yeah, I can't tell a difference subjectively between Opus and Sonnet, other than maybe a sort of placebo effect. I'm more careful to write quality prompts when using Opus, because I don't want to waste the 5x more expensive tokens.

Uehreka•7h ago

> yet the general consensus and my own experience seem to be that Sonnet is much much better

Given that there’s nothing close to scientific analysis going on, I find it hard to tell how big the “Sonnet is overall better, not just sometimes” crowd is. I think part of the problem is that “The bigger model is better” feels obvious to say, so why say it? Whereas “the smaller model is better actually” feels both like unobvious advice and also the kind of thing that feels smart to say, both of which would lead to more people who believe it saying it, possibly creating the illusion of consensus.

I was trying to dig into this yesterday, but every time I come across a new thread the things people are saying and the proportions saying what are different.

I suppose one useful takeaway is this: If you’re using Claude Max and get downgraded from Opus to Sonnet for a few hours, you don’t have to worry too much about it being a harsh downgrade in quality.

taormina•7h ago

Just more ancedata, but I entirely agree. I can't say that I am happy with Sonnet's output at any point, really, but it still occasionally works, whereas Opus has been a dumpster fire every single time.

SkyPuncher•7h ago

I don't doubt Opus is technically superior, but it's not practically superior for me.

It's still pretty much impossible to have any LLM one-shot a complex implementation. There's just too many details to figure out and too much to explain for it to get correct. Often, there's uncertainty and ambiguity that I only understand the correct answer (or rather less bad answer) after I've spent time deep in the code. Having Opus spit out a possibly correct solution just isn't useful to me. I need to understand _why_ we got to that solution and _why_ it's a correct solution for the context I'm working in.

For me, this means that I largely have an iteratively driven implementation approach where any particular task just isn't that complex. Therefore, Sonnet is completely sufficient for my day-to-day needs.

ssk42•5h ago

You can also always have it create design docs and mermaid diagrams for each task. Outline the why much easier earlier, shifting left

bdamm•4h ago

I've been having a great time with Windsurf's "Planning" feature. Have a nice discussion with Cascade (Claude) all about what it is that neerds to happen - sometimes a very long conversation including test code. Then when everything is very clear, make it happen. Then test and debug the results with all that context. Pretty nice.

jstummbillig•3h ago

Can you explain what you do exactly? Do you enable plan mode and use with chat...?

trenchpilgrim•1h ago

In Zed I switch the AI panel to ask mode and chat with the agent about different approaches and have it draft patches. Then when I think there's a design worth trying, switch to Write mode and have it implement that change + run the tests and diagnostics to verify the code at least compiles, tests pass and follows our style guides. Finally a line by line review + review of the test coverage (in terms of interface surface area) before submitting a PR for another human review.

Larrikin•1h ago

After watching a few videos trying to understand how people were using LLMs and getting useful results I found that even making a simpler version of the fancy planning mode in the LLM IDEs via the instructions.md produced hugely better productivity gains.

I started adding an instruction file along the lines of "Always tell me your plan to solve the issue first with short example code, never edit files without explicit confirmation of your plan" at the start and it is like a day and night difference in how useful it becomes. It also starts to feel like programming again where you can read through various files and instead of thinking in your head, you write out your thoughts. You end up getting confirmation or push back on errors that you can clean up.

Reading through a sort of wrong sort of right implementation spread across various files after every prompt just really sucked.

I'm not one shotting massive amounts of files, but I am enjoying the lack of grunt work.

jm4•7h ago

I use both. Sonnet is faster and more cost efficient. It's great for coding. Where Opus is noticeably better is in analysis. It surpasses Sonnet for debugging, finding patterns in data, creativity and analysis in general. It doesn't make a lot of sense to use Opus exclusively unless you're on a max20 plan and not hitting limits. Using Opus for design and troubleshooting and Sonnet for everything else is a good way to go.

astrostl•7h ago

With aggressive Claude Code use I didn't find Sonnet better than Opus but I did find it faster while consuming far fewer tokens. Once I switched to the $100 Max plan and configured CC to exclusively use Sonnet I haven't run into a plan token limit even once. When I saw this announcement my first thing was to CMD-F and see when Sonnet 4.1 was coming out, because I don't really care about Opus outside of interactive deep research usage.

ssss11•3h ago

That’s very strange. Sonnet is hot garbage and Opus is a miracle, for me. I also don’t see anyone praising sonnet anywhere.

sky2224•2h ago

I've found with limited context provided in your prompt, opus is just awful compared to even gpt-4.1, but once I give it even just a little bit more of an explanation, it jumps leagues ahead.

sothatsit•2h ago

Opus really shines for completing long-running tasks with no supervision. But if you are using Claude Code interactively and actively steering it yourself, Sonnet is good enough and is faster.

I don't believe anyone saying Sonnet yields better results than Opus though, as my experience has been exactly the opposite. But trade-off wise, I can definitely see it being a better experience when used interactively because of its speed and lower cost.

Aeolun•1h ago

My opinion of Opus is that it takes the correct action 19/20 times, where Sonnet takes the correct action 18/20 times. It’s not strictly necessary to use Opus, but if you have the subscription already it’s just a pure win.

paxys•8h ago

Why is everything releasing today?

datameta•8h ago

Could it be nobody wanted to be first and overshadowed, nor the only one left out - and it cascaded after the first announcement? My first hunch, though, was that it had been agreed upon. Game theory I think tells us that releasing same day in the pattern ABC BCA CAB etc would be lowest risk and highest average gain?

highfrequency•6h ago

If they release before GPT-5, they don't have to compare to GPT-5 in their benchmarks. It's a big PR win to be able to plausibly claim that your model is the best coding model at the time of release.

gusmally•8h ago

They restarted Claude Plays Pokemon with the new model: https://www.twitch.tv/claudeplayspokemon

(He had been stuck in the Team Rocket hideout (I believe) for weeks)

alrocar•7h ago

just ran the LLM to SQL benchmark over opus-4.1 and it didn't top previous version :thinking: => https://llm-benchmark.tinybird.live/

epolanski•7h ago

How does running it multiple times performs?

LLMs are non-deterministic, I think benchmarks should be more about averages of N runs, rather than single shot experiments.

jedisct1•7h ago

Is it just me or is it super slow?

taormina•7h ago

Alright, well, Opus 4.1 seems exactly as useless as Opus 4 was, but it's probably eating my tokens faster. Wish they let you tell somehow.

At least Sonnet 4 is still usable, but I'll be honest, it's been producing worse and worse slob all day.

I've basically wasted the morning on Claude Code when I should've just been doing it all myself.

AlecSchueler•4h ago

I've also noticed Sonnet starting to degrade. It's developing some of the behaviours that put me off the competition in the first place. Needless explanations, filler in responses, wanting to put everything in lists, even increased sycophancy.

bavell•4h ago

> I've basically wasted the morning on Claude Code when I should've just been doing it all myself.

Welcome to the machine

https://www.youtube.com/watch?v=tBvAxSx0nAM&t=45s

Aeolun•1h ago

I feel like this is just related to my projects getting bigger. Claude Code is trying to keep up with my project evolving from 2k lines of code to 100k lines. Of course it’s going to feel worse.

UncleEntity•48m ago

Other than it starting out trying to produce a full and complete web app (or whatever) for my daily yak shaving session instead of the normal "let's talk about and work through this thing" the new Opus 4.1 seems to 'get it' a lot quicker than the old daffy robot did. It asked pertinent questions to understand the system we are working on and accomplished the goal of updating the design document so I don't have to keep explaining details at the start of every chat session. Something, by the way, it always previously failed to do causing me to have to explain stuff each and every time before forward progress could be made.

I do agree it did hit the token limit a lot quicker than before where I could chat for hours without worrying about it.

Either way, still have one last yak to shave for this project so we'll see how efficient it is with that. If it accomplishes the task before burning through all the tokens then win, win, I suppose.

rvz•7h ago

Notice how Anthropic has never open sourced any of their models.

This makes them (Anthropic) worse than OpenAI in terms of openness.

Since in this case as we all know. [0]

"What will permanently change everything is open source and transparent AI models that are smaller and more powerful than GPT-3 or even GPT-4."

[0] https://news.ycombinator.com/item?id=34865626

jjani•7h ago

On the other hand, they have always exposed their raw chain of thought, so you know exactly what you're paying for, unlike OpenAI who hides it. Similarly they allow an actual thinking budget rather than vague "low, medium, high", again unlike OpenAI. They also allow API access to all their models without draconic send-us-your-personal-data-KYC, once more unlikely OpenAI.

They might not fit your personal definition of "openness", but they do fit many other equally valid interpretations of that contept.

ryandrake•7h ago

Am I the only one super confused about how to even get started trying out this stuff? Just so I wouldn't be "that critic who doesn't try the stuff he criticizes," I tried GitHub Copilot and was kind of not very impressed. Someone on HN told me Copilot sucks, use Claude. But I have no idea what the right way to do it is because there are so many paths to choose.

Let's see: we have Claude Code vs. Claude the API vs. Claude the website, and they're totally different from each other? One is command line, one integrates into your IDE (which IDE?) and one is just browser based, I guess. Then you have the different pricing plans, Free, Pro, and Max? But then there's also Claude Team and Claude Enterprise? These are monthly plans that only work with Claude the Website, but Claude Code is per-request? Or is it Claude API that's per-request? I have no idea. Then you have the models: Claude Opus and Claude Sonnet, with various version numbers for each?? Then there's Cline and Cursor and GOOD GRIEF! I just want to putz around with something in VSCode for a few hours!

adamors•7h ago

Download Cursor and try it through that, IMO that's currently the most polished experience especially since you can change models on the fly. For more advanced usecases, CLI is better but for getting your feet wet I think Cursor is the best choice.

ryandrake•6h ago

Thanks. Too bad you need to switch editors to go that path. I assume the Cursor monthly plans are not the same as the Claude monthly plans and you can't use one for the other if you want to experiment...

kingnothing•6h ago

Cursor is built on VSCode.

olalonde•7h ago

Claude Code CLI.

ryandrake•7h ago

Thanks. With the CLI, can you get Copilot-ish things like tab-completion and inline commands directly in your IDE? Or do you need to copy/paste to and from a terminal? It feels like running a command on the IDE and then copying the output into your IDE is a pretty primitive way to operate.

cultureulterior•6h ago

Claude does the coding, and edits your files. You just sit back and relax. You don't do any tab completion etc.

avemg•6h ago

My advice is this:

1) Completely separate in your mind the auto-completion features from the agentic coding features. The auto-completion features are a neat trick but I personally find those to be a bit annoying overall, even if they sometimes hit it completely right. If I'm writing the code, I mostly don't want the LLM autocompletion.

2) Pay the $20 to get a month of Claude Pro access and then install Claude Code. Then, either wait until you have a small task in mind or your stuck on some stupid issue that you've been banging your head on and then open your terminal and fire up Claude Code. Explain to it in plain English what you want it to do. Pretend it's a colleague that you're giving a task to over Slack. And then watch it go. It works directly on your source code. There is no copying and pasting code.

3) Bookmark the Claude website. The next time you'd Google something technical, ask it Claude instead. General questions like "how does one typically implement a flizzle using the floppity-do framework"? "I'm trying to accomplish X, what are my options when using this stack?". General questions like that.

From there you'll start to get it and you'll get better at leverage the tool to do what you want. Then you can branch out the rest of the tool ecosystem.

ryandrake•6h ago

Interesting about the auto-completion. That was really the only Copilot feature I found to be useful. The idea of writing out an English prompt and telling Copilot what to write sounded (and still sounds) so slow and clunky. By the time I've articulated what I want it to do, I might as well have written the code myself. The auto-completion was at least a major time-saver.

"The card game state is a structure that contains a Deck of cards, represented by a list of type Card, and a list of Players, each containing a Hand which is also a list of type Card, dealt randomly, round-robin from the Deck object." I could have input the data structure and logic myself in the amount of time it took to describe that.

avemg•6h ago

I think you should embrace a bit of ambiguity. Don't treat this like a stupid computer where you have to specify everything in minute detail. Certainly the more detail you give, the better to an extent. But really: Treat it like you're talking to a colleague and give it a shot. You don't have to get it right on the first prompt. You see what it did and you give it further instructions. Autocomplete is the least compelling feature of all of this.

Also, I don't remember what model Copilot uses by default, especially the free version, but the model absolutely makes a difference. That's why I say to spend the $20. That gives you access to Sonnet 4 which is where, imo, these models took a giant leap forward in terms of quality of output.

ryandrake•6h ago

Thanks, I shall give it a try.

rstupek•4h ago

Is Opus as big a leap as sonnet4 was?

stillpointlab•6h ago

One analogy I have been thinking about lately is GPUs. You might say "The amount of time it takes me to fill memory with the data I want, copy from RAM to the GPU, let the GPU do it's thing, then copy it back to RAM, I might as well have just done the task on the CPU!"

I hope when I state it that way you start to realize the error in your thinking process. You don't send trivial tasks to the GPU because the overhead is too high.

You have to experiment and gain experience with agent coding. Just imagine that there are tasks where the overhead of explaining what to do and reviewing the output are dwarfed by the actual implementation. You have to calibrate yourself so you can recognize those tasks and offload them to the agent.

potatolicious•6h ago

There's a sweet spot in terms of generalization. Yes, painstakingly writing out an object definition in English just so that the LLM can write it out in Java is a poor use of time. You want to give it more general tasks.

But not too general, because then it can get lost in the sauce and do something profoundly wrong.

IMO it's worth the effort to know these tools, because once you have a more intuitive sense for the right level of abstraction it really does help.

So not "make this very basic data structure for me based on my specs", and more like "rewrite this sequential logic into parallel batches", which might take some actual effort but also doesn't require the model to make too many decisions by itself.

It's also pretty good at tests, which tends to be very boilerplate-y, and by default that means you skip some cases, do a lot of brain-melting typing, or copy-and-paste liberally (and suffer the consequences when you missed that one search and replace). The model doesn't tire, and it's a simple enough task that the reliability is high. "Generate test cases for this object, making sure to cover edges cases A, B, and C" is a pretty good ROI in terms of your-time-spent vs. results.

collinvandyck76•7h ago

Claude Code is the superior interface in my opinion. Definitely start there.

Filligree•7h ago

You need Claude Pro or Max. The website subscription also allows you to use the command line tool—the rate limits are shared—and the command line tool includes IDE integration, at least for VSCode.

Claude Code is currently best-in-class, so no point in starting elsewhere, but you do need to read the documentation.

wahnfrieden•5h ago

Correct. Claude Code Max with Opus. Don’t even bother with Sonnet.

kelnos•3h ago

I wouldn't be too prescriptive. I have Pro, and it's fine. I'm not an incredibly heavy user (yet?); I've hit the rate limits a couple times, but not to the point where I'm motivated to spend more.

I haven't tried it myself, but I've heard from people that Opus can be slow when using it for coding tasks. I've only been using Sonnet, and it's performed well enough for my purposes.

Filligree•2h ago

Sonnet works fine in many cases. Opus is smarter, and custom 'agents' can be set to use either.

I prefer configuring it to use Sonnet for things that don't require much reasoning/intelligence, with Opus as the coordinator.

wahnfrieden•1h ago

Opus is slow, so sessions should be used in parallel, likely across work trees. You shouldn't sit and wait on an Opus agent.

47282847•1h ago

> You need Claude Pro or Max.

Actually, to try it out, prepaid token billing is fine. You are not required to have a subscription for claude code cli. Even just $5 gave me enough breathing room to get a feeling for its potential, personally. I do not touch code often these days so I was relieved not to have to subscribe and cancel again just to play around a little and have it write some basic scripts for me.

vlade11115•7h ago

Claude Code has two usage modes: pay-per-token or subscription. Both modes are using API under the hood, but with subscription mode you are only paying a fixed amount a month. Each subscription tier has some undisclosed limits, cheaper plans have lower usage limits. So I would recommend paying $20 and trying the Claude Code via that subscription.

dennisy•6h ago

No Opus in the $20 tier though sadly

oblio•6h ago

What does Opus do extra?

lxgr•6h ago

It's a much larger, more capable LLM than Claude Sonnet.

andyferris•2h ago

As far as I can tell - that seems to have changed today!

kace91•6h ago

I’m looking for cursor alternatives after confusing pricing changes. Is Claude code an option? Can be integrated on an editor/ide for similar results?

My use case so far is usually requesting mechanic work I would rather describe than write myself like certain test suites, and sometimes discovery on messy code bases.

andyferris•2h ago

Claude Code is really good for this situation.

If you like an IDE, for example VS Code you can have the terminal open at the bottom and run Claude Code in that. You can put your instructions there and any edits it makes are visibile in the IDE immediately.

Personally I just keep a separate terminal open and have the terminal and VSCode open on two monitors - seems to work OK for me.

prinny_•7h ago

What exactly did you try with GitHub copilot? It’s not an LLM itself, just in interface for an LLM. I have copilot in my professional GitHub account and I can choose between chat-gpt and Claude.

AlecSchueler•6h ago

I'm not sure what's complicated about what you're describing? They offer two models and you can pay more for higher usage limits, then you can choose if you want to run it in your browser or in your terminal. Like what else would you expect?

Fwiw I have a Claude pro plan and have no interest in using other offerings so I'm not sure if they're super simple (one model, one interface, one pricing plan)?

onlyrealcuzzo•6h ago

When people post this stuff, it's like, are you also confused that Nike sells shoes AND shorts AND shirts, and there's different colors and skus for each article of clothing, and sometimes they sell direct to consumer and other times to stores and to universities, and also there's sales and promotions, etc, etc?

It's almost as if companies sell more than one product.

Why is this the top comment on so many threads about tech products?

Imustaskforhelp•6h ago

Because I think that claude has gone beyond tech niche at this point..

Or maybe that's me, but still whether its through the likes of those vibe coding apps like lovable bolt etc.

at the end of the day, Most people are using the same tool which is claude since its mostly superior in coding (questionable now with oss models, but I still use it through kiro).

People expect this stuff to be simple when in reality its not and there is some frustation I suppose.

furyofantares•6h ago

In this case, they tried something and were told they were doing it wrong, and they know there's more than one way to do it wrong - wrong model, wrong tool using the model, wrong prompting, wrong task that you're trying to use it for.

And of course you could be doing it right but the people saying it works great could themselves be wrong about how good it is.

On top of that it costs both money and time/effort investment to figure out if you're doing it wrong. It's understandable to want some clarity. I think it's pretty different from buying shoes.

AlecSchueler•6h ago

Is it though? People complain about sore feet and hear they wear the wrong kind of shoes so they go to the store where they have to spend money to find out while trying to navigate between dress shoes, minimal shoes, running shoes, hiking shoes etc etc., they have to know their size, ask for assistance in trying them on...

evilduck•6h ago

> I think it's pretty different from buying shoes.

Shoe shopping is pretty complex, more so than trialing an AI model in my opinion.

Are you a construction worker, a banker, a cashier or a driver? Are you walking 5 miles everyday or mostly sedentary? Do you require steel toed shoes? How long are you expecting them to last and what are you willing to pay? Are you going to wear them on long runs or take them river kayaking? Do they need to be water resistant, waterproof or highly breathable? Do you want glued, welted, or stitch down construction? What about flat feet or arch support? Does shoe weight matter? What clothing are you going to wear them with? Are you going to be dancing with them? Do the shoes need a break in period or are they ready to wear? Does the available style match your preferences? What about availability, are you ok having them made to order or do you require something in stock now?

By comparison I can try 10 different AI services without even needing to stand up for a break while I can't buy good dress shoes in the same physical store as a pair of football cleats.

kelnos•4h ago

> Shoe shopping is pretty complex, more so than trialing an AI model in my opinion.

Oh c'mon, now you're just being disingenuous, trying to make an argument for argument's sake.

No, shoe shopping is not more complicated than trialing a LLM. For all of those questions about shoes you are posing, either a) a purchaser won't care and won't need to ask them, or b) they already know they have specific requirements and will know what to ask.

With an LLM, a newbie doesn't even know what they're getting into, let alone what to ask or where to start.

> By comparison I can try 10 different AI services without even needing to stand up for a break

I can't. I have no idea how to do that. It sounds like you've been following the space for a while, and you're letting your knowledge blind you to the idea that many (most?) people don't have your experience.

UncleEntity•27m ago

Just play with the 'free tier' on whatever website does the AI thing and figure it out.

Maybe there's a need to try ten different ones but I just stuck with one and can now convince it to do what I want it to do pretty successfully.

UncleEntity•37m ago

Ya know, in the over half a century I've been on this planet, choosing a new pair of shoes is so low on my 'life's little annoyances' list that it doesn't even rise above the noise of all the stupid random things which actually do annoy me.

Maybe the problem is I don't take shoes seriously enough? Something to work on...

ryandrake•6h ago

Hey, I'm open to the idea that I'm just stupid. But, if people in your target market (software developers) don't even understand your product line and need a HOWTO+glossary to figure it out, maybe there's also a branding/messaging/onboarding problem?

DougBTX•4h ago

My hot take is that your friend should show you what they’re using, not just dismiss Copilot and leave you hanging!

gmueckl•6h ago

When you walk into a store, you can see and touch all of these products. It's intuitive.

With all this LLM cruft all you get is essentially the same old chat interface that's like the year 2000 called and wants its on-line chat websites back. The only thing other than a text box that you usually get is a model selector dropdown squirreled away in a corner somewhere. And that dropdown doesn't really explain the differences between the cryptic sounding options (GPT-something, Claude Whatever...). Of course this confuses people!

derefr•6h ago

Claude.ai, ChatGPT, etc. are finished B2C products. They're black boxes, encapsulated experiences. Consumers don't want to pick a model, or know what model they're using; they just want to "talk to AI", and for the system to know which model is best to answer any given question. I would bet that for these companies, if their frontend observes you using the little model override button, that gets instrumented as an "oops" event in their metrics — something they aim to minimize.

What you're looking for, are the landing pages of the B2B API products underlying these B2C experiences. That would be https://www.anthropic.com/claude, https://openai.com/api/, etc. (In general, search "[AI company] API".)

From those B2B landing pages, you can usually click through to pages with details about each of their models.

Here's the model page corresponding to this news announcement, for example: https://www.anthropic.com/claude/opus

(Also, note how these B2B pages are on the AI companies' own corporate domains; whereas their B2C products have their own dedicated domains. From their perspective, their B2C offerings are essentially treated as separate companies that happen to consume their APIs — a "reference use-case" — rather than as a part of what the B2B company sells.)

margalabargala•6h ago

If anything, Anthropic has the product lineup that makes the most sense. Higher numbers mean better model. Haiku < Sonnet < Opus which translates to length/size. Free < Pro < Max.

Contrast to something like OpenAI. They've got gpt4.1, 4o, and o4. Which of these are newer than one another? How do people remember which of o4 and 4o are which?

hvb2•6h ago

Not sure is this is sarcasm I'm assuming not.

You're comparing well understood products that are wildly different to products with code names. Even someone who has never wore a t-shirt will see it on a mannequin and know where it goes.

I'm sorry but I cannot tell what the difference is between sonnet and opus. Unless one is for music...

So in this case you read the docs. Which is, in your analogy, you going to the Nike store and reading up on if a tshirt goes on your upper or lower body.

potatolicious•6h ago

Eh, this seems like a take that reeks a bit of "everyone is stupid except me".

I do know the answer to OP's question but that's because I pickle my brain in this stuff. It is legitimately confusing.

The analogy to different SKUs strikes me also inaccurate. This isn't the difference between shoes, shirts, and shorts - it's more as if a company sells three t-shirts but you can't really tell what's different about them.

It's Claude, Claude, and Claude. Which ones code for you? Well, actually, all of them (Code, web/desktop Claude, and the API can all do this)

Which ones do you ask about daily sundry queries? Well, two of them (web/desktop Claude, but also the API, but not Code). Well, except if your sundry query is about a programming topic, in which case Code can also do that!

Ok, if I do want to use this to write code, which one should I use? Honestly, any of them, and the company does a poor job of explaining why you would use each option.

"Which of these very similar-seeming t-shirts should I get?" "You knob. How are posts like this even being posted." is just an extremely poor way to approach other people, IMO.

ryandrake•5h ago

> It's Claude, Claude, and Claude. Which ones code for you?

Thanks for articulating the confusion better than I could! I feel it's a similar branding problem as other tech companies have: I'm watching Apple TV+ on my Apple TV software running on my Apple TV connected to my Google TV that isn't actually manufactured by Google. But that Google TV also has an Apple TV app that can play Apple TV+.

potatolicious•5h ago

It's a bit worse than a branding problem honestly, since there's legitimate overlap between products, because ultimately they're different expressions of the same underlying LLMs.

I'm not sure if you ever got a good rundown, but the tl;dr is that the 3 products ("Desktop", Code, and API) all expose the same underlying models, but are given different prompts, tools, and context management techniques that make them behave fairly differently and affect how you interact with them.

- The API is the bare model itself. It has some coding ability because that's inherent to the model - you can ask it to generate code and copy and paste it for example. You normally wouldn't use this except that if you're using some Copilot-type IDE integration where the IDE is doing the work of talking to the model for you and integrating it into your developer experience. In that case you provide API key and the IDE does the heavy lifting.

- The desktop app is actually a half-decent coder. It's capable of producing specific artifacts, distinguishing between multiple "files" it's writing for you, and revisiting previously-written code. "Oh, actually rewrite this in Go." is for example a thing it can totally do. I find it useful for diagnosing issues interactively.

- "Claude Code" is a CLI-only wrapper around the model. Think of it like Anthropic's first-party IDE integration, except there's not an IDE, just the CLI. In this case the integration gives the tool broad powers to actually navigate your filesystem, read specific files, write to specific files, run shell commands like builds and tests, etc. These are all functions that an IDE integration would also give you, but this is done in a Claude-y way.

My personal take is: try Claude Code, since as long as you're halfway comfortable with a CLI it's pretty usable. If you really want a direct IDE integration you can go with the IDE+API key route, though keep in mind that you might end up paying more (Claude Code is all-you-can-eat-with-rate-limits, where API keys will... just keep going).

ryandrake•5h ago

Wow. After 50 replies to what I thought wasn't such a weird question, your rundown is the most enlightening. Thank you very much.

Karrot_Kream•4h ago

FWIW it's probably because a lot of us have been following along and trying these things from the start so the nuances seem more obvious but also I feel that some folks feel your question is a bit "stupid", like "why are you suddenly interested in the frontier of these models? where were you for the last 2 years?"

And to some extent it is like the PC race. Imagine going to work and writing software for whatever devices your company writes software for in whatever toolchain your company uses. Then 2-3 years after the PC race began heating up, asking "Hey I only really write code for whatever devices my employer gives me access to. Now I want to buy one of these new PCs but I don't really understand why I'd choose an Intel over a Motorolla chipset or why I'd prioritize more ROM or more RAM, and I keep hearing about this thing called RISC that's way better than CISC and some of these chips claim to have different addressing modes that are better?"

slackpad•4h ago

Claude Code running in a terminal can connect to your IDE so you can review its proposed changes there. I’ve found this to be a nice drop in way to try it out without having to change your core workflow and tools too much. Check out the /ide command for details.

Karrot_Kream•4h ago

Also when it comes to API integrations, I find some better than others. Copilot has been pretty crummy for me but Zed's Agent Mode seems to be almost as good as Claude Code. I agree with the general take that Claude Code is a good default place to start.

tomrod•5h ago

> Why is this the top comment on so many threads about tech products?

Because you overestimate the difference that the representative person understands.

A more accurate analogy is that Nike sells green-blue shoes and Nike sells blue-green shoes, but the blue-green shoes add 3 feet to your jump and green-blue shoes add 20 mph to your 100 yard dash sprint.

You know you need one of them for tomorrow's hurdles race but have no idea which is meaningful for your need.

ryandrake•5h ago

Also, the green-blue shoes charge per-step, but the blue-green shoes are billed monthly by signing up for BlueGreenPro+ or BlueGreenMax+, each with a hidden step limit but BlueGreenMax+ is the one that gives you access to the Cyan step model which is better; plus the green-blue shoes are only useful when sprinting, but the blue-green shoes can be used in many different events, but only through the Nike blue-green API that only some track&field venues have adopted...

true_religion•5h ago

This is like being told to buy Nike shoes. Then when you proudly display your new cleats, they tell you "no, I meant you should by basketball shoes. The cleats are terrible."

squeaky-clean•5h ago

Which Nike shoe is best for basketball? The Nike Dunk, Air Force 1, Air Jordan, LeBron 20, LeBron XXI Prime 93, Kobe IX elite, Giannis Freak 7, GT Cut, GT Cut 3, GT Cut 3 Turbo, GT Hustle 3, or the KD18?

At least with those you can buy whatever you think is coolest. Which Claude model and interface should the average programmer use?

AlecSchueler•4h ago

What's the average programmer? Is it someone who likes CLI tools? Or who likes IDE integration? Different strokes for different folks and surely the average programmer understands what environment they will be most comfortable in.

nawgz•3h ago

> Different strokes for different folks and surely the average programmer understands what environment they will be most comfortable in.

That's a silly claim to me, we're talking about a completely new environment where you prompt an AI to develop code, and therefore an "average programmer" is unlikely to have any meaningful experience or intuition with this flow. That is exactly what GP is talking about - where does he plug in the AI? What tradeoffs are there to different options?

The other day I had someone judge me for asking this question by dismissively saying "dont say youve still been using ChatGPT and copy/paste", which made me laugh - I don't use AI at all, so who was he looking down on?

kelnos•4h ago

Because the offerings are not simple. Your Nike example is silly; everyone knows what to do with shoes and shorts and shirts, and why they might want (or not want) to buy those particular items from Nike.

But for someone who hasn't been immersed in the "LLM scene", it's hard to understand why you might want to use one particular model of another. It's hard to understand why you might want to do per-request API pricing vs. a bucketed usage plan. This is a new technology, and the landscape is changing weekly.

I think maybe it might be nice if folks around here were a bit more charitable and empathetic about this stuff. There's no reason to get all gatekeep-y about this kind of knowledge, and complaining about these questions just sounds condescending and doesn't do anyone any good.

pdntspa•4h ago

Because few seem to want to expend the effort to dive in and understand something. Instead they want the details spoonfed to them by marketing or something.

I absolutely loathe this timeline we're stuck in.

windsignaling•4h ago

On the contrary, I'm confused about why you're confused.

This is a well-known and documented phenomenon - the paradox of choice.

I've been working in machine learning and AI for nearly 20 years and the number of options out there is overwhelming.

I've found many of the tools out there do some things I want, but not others, so even finding the model or platform that does exactly what I want or does it the best is a time-consuming process.

joshmarlow•6h ago

VSCode has a pretty good Gemini integration - it can pull up a chat window from the side. I like to discuss design changes and small refactorings ("I added this new rpc call in my protobuf file, can you go ahead and stub out the parts of code I need to get this working in these 5 different places?") and it usually does a pretty darn good job of looking at surrounding idioms in each place and doing what I want. But gemini can be kind of slow here.

But I would recommend just starting using Claude in the browser, talk through an idea for a project you have and ask it to build it for you. Go ahead and have a brain storming session before you actually ask it to code - it'll help make sure the model has all of the context. Don't be afraid to overload it with requirements - it's generally pretty good at putting together a coherent plan. If the project is small/fits in a single file - say a one page web app or a complicated data schema + sql queries - then it can usually do a pretty good job in one place. Then just copy+paste the code and run it out of the browser.

This workflow works well for exploring and understanding new topics and technologies.

Cursor is nice because it's an AI integrated IDE (smoother than the VSCode experience above) where you can select which models to use. IMO it seems better at tracking project context than Gemini+VSCode.

Hope this helps!

spaceman_2020•6h ago

Download Claude Code

Create a new directory in your terminal

Open that directory, type in "Claude" to run Claude

Press Shit + Tab to go into planning mode

Tell Claude what you want to build - recommend something simple to start with. Specify the languages, environment, frameworks you want, etc.

Claude will come up with a plan. Modify the plan or break it into smaller chunks if necessary

Once plan is approved, ask it to start coding. It will ask you for permissions and give you the finished code

It really is something when you actually watch it go.

zarzavat•6h ago

Github Copilot and Claude code are not exactly competitors.

Github Copilot is autocomplete, highly useful if you use VS Code, but if you are using e.g. Jetbrains then you have other options. Copilot comes with a bunch of other stuff that I rarely use.

Claude code is project-wide editing, from the CLI.

They complement each other well.

As far as I'm concerned the utility of the AI-focused editors has been diminished by the existence of Claude code, though not entirely made redundant.

fkyoureadthedoc•6h ago

> Github Copilot is autocomplete... comes with a bunch of other stuff that I rarely use.

That bunch of other stuff includes the chat, and more recently "Agent Mode". I find it pretty useful, and the autocomplete near useless.

qingcharles•5h ago

This isn't correct. GitHub Copilot now totally competes with Claude Code. You can have it create an entire app for you in "Agent" mode if you're feeling brave. In fact, seeing as Copilot is built directly into Visual Studio when you download it, I guess they have a one-up.

Copilot isn't locked to a specific LLM, though. You can select the model from a panel, but I don't think you can plug in your own right now, and the ones you can select might not be SOTA because of that.

alienbaby•2h ago

Sonnet 4 in copilot agent mode has been doing great work for me lately. Especially once you realise that at least 50% of the work is done before you get to copilot, as architectural and product specs and implementations plans.

tomwojcik•3h ago

Opencode https://github.com/sst/opencode provides a CC like interface for copilot. It's a slightly worse tool, but since copilot with Claude 4 is super cheap, I ended up preferring it over CC. Almost no limits, cheaper, you can use all the Copilot models, GH is not training on your data.

andsoitis•5h ago

> use Claude. But I have no idea what the right way to do it is because there are so many paths to choose.

Anthropic has this useful quick start guide: https://docs.anthropic.com/en/docs/claude-code/quickstart

StephenHerlihyy•5h ago

Kilo Code for VSCode is pretty solid. Give it a try.

wintermutestwin•5h ago

Yes. You basically need an LLM to provide guidance on product selection in this brave new world.

It is actually one of my most useful use cases of this tech. Nice to have a way to ask in private so you don’t get snarky answers like: it’s just like buying shoes!

vanillax•5h ago

All the tools, copilot,claude, gemini in vscode are all completely worthless unless in Agent Mode. I have no idea why none of these tools dont default to Agent mode.

ActorNightly•5h ago

If you want your own cheap IDE integration, you can set up VSCode with Continue extension, ollama running locally, and a small agent model. https://docs.continue.dev/features/agent/model-setup.

If you want to understand how all of this works, the best way is to build a coding agent manually. Its not that hard

1. Start with Ollama running locally and Gemma3 QAT models. https://ollama.com/library/gemma3

2. Write a wrapper around Ollama using your favorite language. The idea is that you want to be able to intercept responses coming back from the model.

3. Create a system prompt that tells the model things like "if the user is asking you to create a file, reply in this format:...". Generally to start, you can specify instructions for read file, write file, and execute file

4. In your wrapper, when you send the input chat prompt, and get the model response back, you look for those formats, and make the wrapper actually execute the action. For example if the model replies back with the format to read file, you read the file from your wrapper code and send it back to the model.

Every coding assistant is basically this under the hood with just a lot more fluff and their own IDE integration.

The benefit of doing your own is that you can customize it to your own needs, and when you direct a model with more precision even the small models perform very well with much faster speed.

afro88•5h ago

OP is asking for where to get started with Claude for coding. They're confused. They just want to mess around with it in VSCode. And you start talking about Ollama, PAT, coding your own wrapper, composing a system prompt etc.!?

jimbo808•4h ago

You just described all of your options in detail - what's the problem? Pick one. Seems like you've got a very thorough grasp on how to get started trying the stuff out, but it requires you to choose how you want to do that.

kelnos•4h ago

If you're looking for a coding assistant, get Claude Code, and give it a try. I think you need the Pro plan at a minimum for that ($20/mo; I don't think Free includes Claude Code). Don't do the per-request API pricing as it can get expensive even while just playing around.

Agree that the offering is a bit confusing and it's hard to know where to start.

Just FYI: Claude Code is a terminal-based app. You run it in the working directory of your project, and use your regular editor that you're used to, but of course that means there's no editor integration (unlike something like Cursor). I personally like it that way, but YMMV.

robluxus•4h ago

> I just want to putz around with something in VSCode for a few hours!

I just googled "using claude from vscode" and the first page had a link that brought me to anthropic's step by step guide on how to set this up exactly.

Why care about pricing and product names and UI until it's a problem?

> Someone on HN told me Copilot sucks, use Claude.

I concur, but I'm also just a dude saying some stuff on HN :)

zaphirplane•4h ago

try asking it ?

screye•3h ago

Cursor + Claude 4 = best quality + UX balance. Pay up for 20/month subscription.

Cursor imports in your VSCode setup. Setting it up should be trivial.

Use Agent mode. Use it in a preexisting repo.

You're off the races.

There is a lot more you can do, but you should start seeing value at this point.

w0m•3h ago

honestly - copilot free mode; and just play with the agentic stuff can give you a good idea. Attach it to Roo and you'll get a good idea. Realize that if you paid to use a better model; you'd get better results as free doesn't have a ton of premium tokens.

ramesh31•7h ago

Will the price for 4 go down? I still find Opus completely unusable for the cost/performance, as someone who spends thousands per month on tokens. There's really no noticeable difference from Sonnet, at nearly 10x the price.

_vaporwave_•7h ago

It's interesting that Anthropic maintains current prices for prior state of the art models when doing a new release. Why offer a model with worse performance for the same price? What incentives are they trying to create?

dysoco•7h ago

I'm guessing it's mostly for legacy reasons. When 3.7 came out many people were not happy with it and went back to 3.5; I guess supporting older models for a while makes sense.

gwd•4h ago

> What incentives are they trying to create?

One obvious explanation is that pricing is strongly related to the price to them, and that their only incentive is for people to use an expensive model of they really need it.

I forget which one of the GPT models was better, faster, and cheaper than the previous model. The incentive there is obviously, "If you want to use the old model for whatever reason, fine, but we really want you to use the new one because costs us less to run."

mrcwinn•6h ago

o3 and o3-pro are just so good. Sonnet goes off the deep end too often and Opus, in my experience, is not as strong at reasoning compared to OpenAI, despite the higher costs. Rarely do we see a worse, more expensive product win - but competition is good and I’m rooting for Anthropic nonetheless!

AlecSchueler•6h ago

Off the deep end?

UncleEntity•20m ago

Probably referring to it's tendency to over-complicate things to the point you have to step in and be like "WTF are you even talking about... Wouldn't it be a lot simpler to just use the original, well planned out design?"

Which it does a lot...

WXLCKNO•5h ago

o3 feels pretty good to me as well but o3-pro has consistently one shotted problems other LLMs got stuck on.

I'm talking multiple tries of claude 4 opus, Gemini 2.5 pro, o3 etc resulting in sometimes hundreds of lines of code.

Versus o3-pro (very slowly) analyzing and then fixing something that seemed completely unrelated in a one or two line change and truly fixing the root cause.

o3-pro level LLMs at reduced cost and increased speed will already be amazing..

bayesianbot•12m ago

OpenAI also has Flex processing[1] for o3. I've spent most of my time with Gemini 2.5, but lately been trying out a ton of o3 as it seems to work quite well and I get really cheap tokens (~95% of my agentic tokens are cached which is 75% discount and flex mode adds 50% for $0.25 / million input tokens)

[1] https://platform.openai.com/docs/guides/flex-processing?api-...

thoop•6h ago

The article says "We plan to release substantially larger improvements to our models in the coming weeks."

Sonnet 4 has definitely been the best model for our product's use case, but I'd be interested in trying Haiku 4 (or 4.1?) just due to the cost savings.

I'm surprised Anthropic hasn't mentioned anything about Haiku 4 yet since they released the other models.

mocmoc•6h ago

Their limits are just … a real road blocker

bananapub•5h ago

huh?

Claude Mad is tens of hours of opus a month, or you can pay per token and have unlimited.

Or did you mean “I wish it was cheaper”?

andyferris•1h ago

Ha - the $200 plan should be renamed to "Claude Mad Max" :)

OldGreenYodaGPT•6h ago

Claude Code has honestly made me at least 10x more productive. I’ve burned through about 3 billion tokens and have been consistently merging 5+ PRs a day, tackling tons of tech debt, improving GitHub Actions, and making crazy progress on product work

totaa•6h ago

can you share your workflow?

steinvakt2•6h ago

I also have this feeling that I'm 2-10x more productive. But isn't it curious how a lot of devs feel this way, but no devs that I know have the experience that any of their colleagues have become 2-10x more productive?

nevertoolate•5h ago

10x means to me that i can finish a month of work in max 2 days and go cloud watching. What does it mean for you?

mwigdahl•5h ago

<raises hand> Our automated test folks were chronically behind, struggling to keep up with feature development. I got the two assigned to the team that was the most behind set up with Claude Code. Six weeks later they are fully caught up, expanding coverage, and integrating AI code review into our build pipeline.

It's not 10x, but those guys do seem like they've hit somewhere around 2x improvement overall.

samtp•5h ago

What type of work do you do and what type of code do you produce?

Because I've found it to work pretty amazingly for things that don't need to be exact (like data modeling) or don't have any security implications (public apps). But for everything else I end up having to find all the little bugs by reading the code line by line, which is much slower than just writing the code in the first place.

AstroBen•5h ago

only 10x? I'm at least 100x as productive. I only type at a measly 100wpm, whereas Claude can output 100+ tokens a second

I'm outputting a PR every 6 minutes. The reviewers are using Claude to review everything. It used to take a day to add 100 lines to the codebase.. now I can add 100 lines in one prompt

If I want even more productivity (at risk of making the rest of my team look slow) I can tell Claude to output double the lines and ship it off for review. My performance metrics are incredible

samtp•5h ago

So no human reads the actual code that you push to production? Are you not worried about security risks, spaghetti code, and other issues? Or does Claude magically make all of those concerns go away?

AstroBen•5h ago

forgot the /s

samtp•5h ago

Sorry lol, sometimes difficult to separate the hype boys from actual sarcasm these days

qingcharles•3h ago

Not sure if joking...?

AstroBen•3h ago

This is only the beginning. I can see myself having 100 Claude tasks running concurrently - the only problem is edits clash between files. I'm working on having Claude solve this by giving each instance its own repo to work with, then I ask the final Claude to mash it all together as best it can

What's 100x productivity multiplied by 100 instances of Claude? 10,000x productivity

Now to be fair and a bit more realistic it's not actually 10000x because it takes longer to push the PR because the file sizes are so big. Let's call it 9800x. That's still a sizable improvement

trallnag•3h ago

Big if true

screye•3h ago

How do you maintain high confidence in the code it generates ?

My current bottleneck is having to review the huge amounts of code that these models spit out. I do TDD, use auto-linting and type-checking.... but the model makes insidious changes that are only visible on deep inspection.

theappsecguy•2h ago

The only way you could be 10x more productive is omit you were doing nothing before.

P24L•6h ago

The improved Opus isn’t about achieving significantly better peak performance for me. It’s not about pushing the high end of the spectrum. Instead, it’s about consistently delivering better average results - structuring outputs more effectively, self-correcting mistakes more reliably, and becoming a trustworthy workhorse for everyday tasks.

djha-skin•5h ago

Opus 4(.1) is so expensive[1]. Even Sonnet[2] costs me $5 per hour (basically) using OpenRouter + Codename Goose[3]. The crazy thing is Sonnet 3.5 costs the same thing[4] right now. Gemini Flash is more reasonable[5], but always seems to make the wrong decisions in the end, spinning in circles. OpenAI is better, but still falls short of Claude's performance. Claude also gives back 400's from its API if you CTRL-C in the middle though, so that's annoying.

Economics is important. Best bang for the buck seems to be OpenAI ChatGPT 4.1 mini[6]. Does a decent job, doesn't flood my context window with useless tokens like Claude does, API works every time. Gets me out of bad spots. Can get confused, but I've been able to muddle through with it.

1: https://openrouter.ai/anthropic/claude-opus-4.1

2: https://openrouter.ai/anthropic/claude-sonnet-4

3: https://block.github.io/goose/

4: https://openrouter.ai/anthropic/claude-3.5-sonnet

5: https://openrouter.ai/google/gemini-2.5-flash

6: https://openrouter.ai/openai/gpt-4.1-mini

generalizations•5h ago

Get a subscription and use claude code - that's how you get actual reasonable economics out of it. I use claude code all day on the max subscription and maybe twice in the last two weeks have I actually hit usage limits.

tgtweak•5h ago

Is it considerably more cost effective than cline+sonnet api calls with caching and diff edits?

Same context length and throughput limits?

Anecdotally I find gpt4.1 (and mini) were pretty good at those agentic programming tasks but the lack of token caching made the costs blow up with long context.

bavell•4h ago

I'm on the basic $20/mo sub and only ran into token cap limitations in the first few days of using Claude Code (now 2-3 weeks in) before I started being more aggressive about clearing the context. Long contexts will eat up tokens caps quickly when you are having extended back-and-forth conversations with the model. Otherwise, it's been effectively "unlimited" for my own use.

bgirard•3h ago

YMMV I'm using the $100/mo max subscription and I hit the limit during a focused coding session where I'm giving it prompts non-stop.

Unfortunately there's no easy tool to inspect usage. I started a project to parse the Claude logs using Claude and generate a Chrome trace with it. It's promising but it was taking my tokens away from my core project.

bartman•2h ago

Check out ccusage, it sounds like the tool you’re describing: https://github.com/ryoppippi/ccusage

bgirard•2h ago

That's neat. According to the tool I'm consuming ~300m tokens per day coding with a (retail?) cost of ~125$/day. The output of the model is definitely worth $100/mo to me.

symbolicAGI•2h ago

ccusage on GitHub.

MarcelOlsz•1h ago

If you use Claude Code with a subscription and run `ccusage` [0] you can get an idea of your "true usage" and cost.

[0] https://github.com/ryoppippi/ccusage

j45•2m ago

Yes, it’s much better.

It uses way less tokens or much more effectively when running locally.

seneca•3h ago

Is there a way to sign up for Claude code that doesn't involve verifying a phone number with Anthropic? They don't even accept Google Voice numbers.

Maybe I'm out of touch, but I'm not handing out my phone number to sign up for random SaaS tools.

tagami•3h ago

use a burner

kroaton•4h ago

GLM 4.5 / Kimi K2 / Qwen Coder 3 / Gemini Pro 2.5

Aeolun•1h ago

In every price comparison I make. Claude (API) always comes out cheapest if you manage to keep most of your context cached. 90% price reduction for input is crazy.

energy123•43m ago

Large models are for querying the model

Small models are for querying the context

Opus is cheap if you use it for its niche

thimabi•2m ago

> Large models are for querying the model > Small models are for querying the context

I respectfully disagree.

My experience is that large models are capable of understanding large contexts much better. Of course they are more expensive and slower, too. But in terms of accuracy, large models are always better at querying the context.

paul7986•5h ago

Claude plus failed me today badly compared to chatGPT plus.

I uploaded a web design of mine (jpeg) and asked Claude to create the html/css. Asked GPT to do the same. GPT's code looked the closet to the design I created and uploaded. Just five to ten small tweaks and I was done vs. Claude it would have taken me almost triple the steps.

I actually subscribed to both today (resubscribed to GPT) and going to keep testing which one is the better front-end developer (i am, but got to embrace AI ).

alvis•4h ago

Funny Open AI and Anthropic seems to be coordinating their releases on the same day

KaoruAoiShiho•4h ago

For me this is the big news of the day. Looks insane.

hartator•4h ago

> 1 min read

What the point of these?

Kind of interesting that we live in an area of AI super advanced, but still make basic UI/UX mistake. The tagline of this blog post shouldn't be "1 min read".

It's not even accurate. I timed myself not reading fast but not slow, took me 3 min 30s. Maybe the images need be OCRed to make the estimation more accurate.

TimMeade•3h ago

This has been the worse Claude day ever. Just fell apart. Not sure if the release is why, but cursing in documents and can not fix a bug after hours of back and forth.

system2•9m ago

Claude lost me after I used it for a day. Their pricing model is bonkers. There is no way any developer in their right mind would go with Claude.

Open models by OpenAI

Genie 3: A new frontier for world models

Spotting base64 encoded JSON, certificates, and private keys

Ollama Turbo

Create personal illustrated storybooks in the Gemini app

Consider using Zstandard and/or LZ4 instead of Deflate

Claude Opus 4.1

Things that helped me get out of the AI 10x engineer imposter syndrome

Scientific fraud has become an 'industry,' analysis finds

What's wrong with the JSON gem API?

The First Widespread Cure for HIV Could Be in Children

Ask HN: Have you ever regretted open-sourcing something?

uBlock Origin Lite now available for Safari

Kyber (YC W23) is hiring enterprise account executives

Show HN: Stagewise (YC S25) – Front end coding agent for existing codebases

Build Your Own Lisp

US reportedly forcing TSMC to buy 49% stake in Intel to secure tariff relief

Quantum machine learning via vector embeddings

Los Alamos is capturing images of explosions at 7 millionths of a second

Under the Hood of AFD.sys Part 1: Investigating Undocumented Interfaces

The mystery of Winston Churchill's dead platypus was finally solved

Cannibal Modernity: Oswald de Andrade's Manifesto Antropófago (1928)

AI is propping up the US economy

No Comment (2010)

Tell HN: Anthropic expires paid credits after a year

Cow vs. Water Buffalo Mozzarella

Eleven Music

Apache ECharts 6

GitHub pull requests were down

Using Dspy to Detect Document Boundaries

Open models by OpenAI

Genie 3: A new frontier for world models

Spotting base64 encoded JSON, certificates, and private keys

Ollama Turbo

Create personal illustrated storybooks in the Gemini app

Consider using Zstandard and/or LZ4 instead of Deflate

Claude Opus 4.1

Things that helped me get out of the AI 10x engineer imposter syndrome

Scientific fraud has become an 'industry,' analysis finds

What's wrong with the JSON gem API?

The First Widespread Cure for HIV Could Be in Children

Ask HN: Have you ever regretted open-sourcing something?

uBlock Origin Lite now available for Safari

Kyber (YC W23) is hiring enterprise account executives

Show HN: Stagewise (YC S25) – Front end coding agent for existing codebases

Build Your Own Lisp

US reportedly forcing TSMC to buy 49% stake in Intel to secure tariff relief

Quantum machine learning via vector embeddings

Los Alamos is capturing images of explosions at 7 millionths of a second

Under the Hood of AFD.sys Part 1: Investigating Undocumented Interfaces

The mystery of Winston Churchill's dead platypus was finally solved

Cannibal Modernity: Oswald de Andrade's Manifesto Antropófago (1928)

AI is propping up the US economy

No Comment (2010)

Tell HN: Anthropic expires paid credits after a year

Cow vs. Water Buffalo Mozzarella

Eleven Music

Apache ECharts 6

GitHub pull requests were down

Using Dspy to Detect Document Boundaries

Claude Opus 4.1

Comments