GPT-5 vs. Sonnet: Complex Agentic Coding

https://elite-ai-assisted-coding.dev/p/copilot-agentic-coding-gpt-5-vs-claude-4-sonnet

155•intellectronica•3h ago

Comments

arcticfox•3h ago

> Note that Claude 4 Sonnet isn’t the strongest model from Anthropic’s Claude series. Claude Opus is their most capable model for coding, but it seemed inappropriate to compare it with GPT-5 because it costs 10 times as much.

Well - I would have been interested in GPT-5 vs. Opus. Claude Code Max is affordable with Opus.

swader999•3h ago

You're absolutely right!

rubslopes•1h ago

"I see it now!"

intellectronica•18m ago

qeternity•2h ago

> Claude Code Max is affordable with Opus

Because Anthropic is presumably massively subsidizing the usage.

kvirani•2h ago

Isn't it all heavily subsidized by VC money at this time?

adventured•1h ago

OpenAI for its part is tracking to $12-$15 billion in annual sales. If they slapped a basic ad model for referring onto what they're already doing, it's an easily profitable enterprise doing $30+ billion in sales next year. Frankly they should have already built and deployed that, it would make their free versions instantly profitable and they could boost usage limits and choke off the competition. It's the very straight-forward path to financially ruining their various weaker competition. Anthropic is Lyft in this scenario (and I say that as a big fan of Claude).

kingstnap•16m ago

The APIs are marginally profitable. You can calculate the lifecycle costs of the open models on clusters in batched inference and figure out its less than than what they charge.

The training and researches are very expensive. The fixed price subscriptions are 100% a sweetheart deal.

Filligree•1h ago

Which doesn’t factor into my immediate decisions.

carterparks•3h ago

I'm getting an SSL error in Chrome: ERR_SSL_PROTOCOL_ERROR

OJFord•2h ago

I get 'unable to connect' in Firefox Android for this and many little blogs on HN lately, idk what's going on. Cloudflare blocking me (but not for all sites)? Geo-restriction (UK)?

SV_BubbleTime•3h ago

> but when I'd point out the missing implementation, it would give its usual "you're absolutely right" and try to fix it.

I really trying to not be annoyed by Claude’s “You’re absolutely right” because I know I cannot control it but this is an increasingly difficult task.

jpalawaga•2h ago

I think it's because "you're right!" somehow presupposes it knew the answer and was just testing you.

an intern never says that. they say "oh, I see."

AlecSchueler•2h ago

Does it also seem to be getting worse this way?

CamperBob2•2h ago

You can control it in the chat page, at least (User name at lower left->Settings). I use this:

    Answer concisely when appropriate, more 
    extensively when necessary.  Avoid rhetorical 
    flourishes, bonhomie, and (above all) cliches.  
    Take a forward-thinking view. OK to be mildly 
    positive and encouraging but NEVER sycophantic 
    or cloying.  Above all, NEVER use the phrase 
    "You're absolutely right."  Rather than "Let 
    me know if..." style continuations, list a 
    set of prompts to explore further topics.

That last bit causes some clutter at the end of each response, not sure if I'm going to keep it. But it does do a good job at following these guidelines in my experience. The same basic instructions also work well in ChatGPT and Gemini.

Does Claude Code not support anything like this?

unshavedyak•1h ago

Claude Code has "memory" files which can be layered (global, project, private). I'll probably add this to mine haha, but mine currently mostly consist of things that i can't automate. Eg i have hooks for tests/lints, so i don't need to tell claude to do those things. However general styling preferences and naming conventions can be a bit more difficult to automate, so i put those in the memory files.

I find it does decently, but it's far from perfect. Eg before hooks i had "ALWAYS format and lint" style entries in the memory file and it probably had a 70% success rate. Often it would go on little side paths cleaning work up and forget to run lints after or w/e. Formatting was my biggest gripe.

Deterministic wrappers have been the biggest gain for me personally. I suspect they'll get a lot better over time too. Eg i want to try and find a way to write more personal style guides in a linter to enforce claude not break various conventions i prefer. But of course, that can be difficult.

micromarty8•1h ago

Claude code supports a wide variety of things to customize your experience and prompts, including adding content like you posted to the Claude.MD file. Between some customer Claude.MD, hooks and custom slash commands, nothing else I have tested has performed as well.

macawfish•3h ago

Claude is just so well rounded and considerate. A lot of this probably comes down to prompt and context engineering, though surely there's something magical about Anthropic's principled training methodologies. They invented constitutional AI and I can only imagine that behind the scenes they're doing really cool stuff. Can't wait to see Claude 5!

stitched2gethr•3h ago

This take rings true for me after admittedly only a couple of hours of use of gpt-5. I had an issue I had been working with Claude on but it was difficult to give it real-time feedback so it floundered. gpt-5 struggled in the same areas but after about $2 of tokens it did fix the issue. It was far from a 1 shot like I might have expected from the hype, but it did get the job in about an hour done where Claude could not in 3.

For reference my Claude usage was mostly Sonnet, but with consulting from Opus.

0xfaded•2h ago

Would you be comfortable sharing a brief description of what the issue was?

indigodaddy•3h ago

What does the 1x and .33x mean on the list of models in copilot? (Never used but thinking about trying on the free tier)

commandar•2h ago

They're multipliers against your quota of requests. GPT-4.1 is "free" with a copilot sub, and then the premium models would burn credits against a multiplier. So higher multipliers count more against your monthly quota.

GPT5, Sonnet 4, and Gemini Pro 2.5 are all 1x. Opus is 10x, for comparison.

https://docs.github.com/en/copilot/reference/ai-models/suppo...

Also worth keeping in mind that Copilot has reduced context windows even for the premium models, which has a very real impact on agentic performance.

indigodaddy•2h ago

Thanks for the info. Would you consider GH Copilot the best bang for buck currently, or would you recommend just going with the Claude $20 plan? I'm definitely not looking to spend a lot of money, just want to see what kind of mileage I can get on low-end plans

commandar•2h ago

It's going to depend heavily on your usage.

I use Copilot because work is paying for it and it can be made usable, but requires being really deliberate about managing context to keep things on the rails. It's nice that it gives you access to a pretty decent selection of models, though.

At home, I'm mostly using the $100 Claude plan. It's definitely not cheap, but I've found it has a pretty decent balance for my casual experiments with agentic coding.

Another option to seriously consider is setting up an account with OpenRouter and just tossing some cash into your bucket on occasion. OpenRouter lets you arbitrarily make API requests to pretty much any model you want. I've been occasionally tossing $10 or so into mine and I'll use it when I've hit my usage limits with Claude or if I want to see how another model will attack a particular task.

FWIW, I use Roo code for all of this, so it's pretty easy for me to switch between models/providers as I need to.

ewoodrich•1h ago

I consider the $10/mo to be an incredible value ... but only because of the unlimited 4.1 usage that can be provided to other compatible extensions (Roo Code, Cline support it) with the VS Code LM API.

Unlike some other workarounds this is a fully supported workflow and does not break Copilot terms of service with reasonable personal usage. (As far as I understand at least. Copilot has full visibility into which tools are using it to make chat requests so it isn't disguising or impersonating Copilot itself. When first setting it up there's a native VS Code approval prompt to allow tool access to Copilot and the LM API is publicly documented).

But anything unlimited in the LLM space feels like it's on borrowed time, especially with 3rd party tool support, so I wouldn't be surprised if they impose stricter quotas for the LM API in the future or remove the unlimited limit entirely).

indigodaddy•9m ago

Great feedback, thank you! I see a lot of 4.1 bashing-- how's your experience been with it?

PufPufPuf•2h ago

It's pricing / usage modifiers, basically price expressed as a ratio of the default model.

bn-l•2h ago

Github copilot is utter garbage. The diffing crawls along at a snail’s pace. I think it’s coming up on two years and this must criticised aspect of it still isn’t fixed—-even with all the reverse engineering of how cursor did it. I wish I could find an alternative to cursor (which has other issues). Honestly, that company just threw away a golden opportunity as the first mover.

sourcecodeplz•2h ago

Why did they throw it away? Because of the new opaque pricing?

swader999•2h ago

They let their moat dry right up.

bredren•2h ago

I've done evaluations of Github Copilot, Sourcegraph Cody and Gitlab Duo and Copilot is not garbage, but rather the by far leader among these other options.

zhivota•2h ago

Did you compare to Cursor? We gave up on Copilot a while back after Cursor blew us away. In the context of this article though, Cursor is very obviously tuned better towards Claude than OpenAI in my experience.

grumple•1h ago

Cursor's agent is better, but the in-editor suggestions by Copilot when you're actually the one coding are very useful. Claude's agent is better than Cursor, so I'm not sure where Cursor fits in with this ecosystem.

mkozlows•1h ago

"Best option among loser tools" isn't the high praise you think it is, though.

electroly•1h ago

Did you compare it against any actual leading tools? Cody and Duo, really? Did you try Cursor and Claude Code?

cebert•2h ago

I’m not sure why you’re getting downvoted. I agree that Copilot seems like complete rubbish.

chromejs10•2h ago

This should have been compared with Opus... I know OP says he didn't because of cost but if you're comparing who is better then you need to compare the best to the best... if Claude Opus 4.1 is significantly better than GPT 5 then that could offset the extra expense. Not saying it will... but forget cost if we want to compare solely the quality

qeternity•2h ago

> but forget cost if we want to compare solely the quality

I think this is the whole reason not to compare it to Opus...

bgirard•2h ago

I agree. Opus is cost prohibitive for most longer coding tasks. The increase output doesn't justify the cost.

fouc•2h ago

gpt-5 isn't supposed to be the best, it's supposed to be cost effective

senko•1h ago

From OpenAI website:

> Our smartest, fastest, and most useful model yet

I'd say it's definitely supposed to be the best, it just doesn't deliver.

cheema33•1h ago

>> Our smartest, fastest, and most useful model yet

> I'd say it's definitely supposed to be the best, it just doesn't deliver.

What part of "Our" is difficult to understand in that statement? Or are you claiming that OpenAI owns another model that is clearly better than GPT-5?

senko•53m ago

Ah so your reading of that statement is "our best, which we acknowledge is not THE best, but it's not supposed to, it's supposed to be cost effective"?

I would suggest reading the entire comment thread before attacking people.

mat0•37m ago

Not the person you were responding to but, if a company provides a service, they want you to use it instead of their competitors. No company is going to say “use ours unless you want to use the best, then use our competitor’s”. so even though I agree with you that they are not explicitly saying “this is the best model in the world”, they are definitely saying “hey this is the best we got, use it”.

sergiotapia•1h ago

You compare what can be used by most engineers. Most engineers are not going to spend that insane price of Opus. It's extremely high compared to all other models, so even if it is slightly better, it's a non-starter for engineering workloads.

andsoitis•1h ago

> t insane price of Opus

I believe Opus starts at $20 a month, similar to GPT5 if you want more than just cursory usage.

Or am I missing something?

sergiotapia•1h ago

Yes you are missing something:

    Claude Opus 4.1

    Most intelligent model for complex tasks
    Input  $15 / MTok
    Output $75 / MTok
    Prompt caching
      Write $18.75 / MTok
      Read  $1.50 / MTok

andsoitis•1h ago

I see. Do you know of a resource that does an across-the board apples-to-apples comparison between the different services (knowing they all price slightly differently)?

It would be useful to be able to easily compare what it costs across the big providers: Gemini, Grok, Claude, ChatGPT.

michaelt•1h ago

For $20/month you get Opus in-browser chat access, and Sonnet claude code access.

If you want to use Opus in claude code, you've got to get the $100/month plan - or pay API prices. And agentic coding uses a lot of tokens.

markbao•1h ago

Most engineers spending their own money maybe, but the cost of Opus is not that much compared to the output when the company is paying for it.

runako•1h ago

re: the comments that Opus is not cost effective...The whole sales pitch behind these tools, and quite specifically the pitch OpenAI made yesterday, is that they will replace people, specifically programmers. Opus is cheaper than a US-based engineer. It's totally reasonable to use it as the benchmark if it's best.

Also keep in mind that many employees are not paying out of pocket for LLM use at work. A $1,000 monthly bill for LLM usage is high for an individual but not so much for a company that employees engineers.

michaelt•1h ago

My experience with coding agents is they need a lot of hand-holding.

They're impressive despite that. But if Sonnet is $20/month and I have to intervene every 3 minutes, while Opus is $100/month and I have to intervene every 5 minutes? ¯\_(ツ)_/¯

runako•52m ago

Really depends on who's paying the bill, and how much gets done between interventions, right?

Inverting the problem, one might ask how best to spend (say) $5,000 monthly on coding agents. I don't know the answer to that.

epolanski•50m ago

> My experience with coding agents is they need a lot of hand-holding.

So do engineers.

The difference is that IRL engineers know a lot about the context of the business, features, product, ux, stakeholders, expectations, etc, etc which means that the hand-holding is a long running process.

LLMs need all of these things to be clearly written down and specified in one shot.

nearbuy•1h ago

For what it's worth, I've been trying Opus 4.1 in VS Code through GitHub Copilot and it's been really bad. Maybe worse than Sonnet and GPT 4.1. I'm not sure why it was doing so poorly.

In one instance, I asked it to optimize a roughly 80 line C# method that matches some object positions by object ID and delta encodes their positions from the previous frame. It seemed to be confused about how all this should work and output completely wrong code. It has all the context it needs in the file and the method is fairly self-contained. Other models did much better. GPT-5 understood what to do immediately.

I tried a few other tasks/questions that also had underwhelming results. Now I've switched to using GPT-5.

If you have a quick prompt you'd like me to try, I can share the results.

cpursley•1h ago

Use Claude Code, the rest aren't worth the bother.

addandsubtract•43m ago

What does Claude Code do differently to Copilot Agent? Shouldn't they produce the same(ish) result if they're using the same model?

DannyBee•2m ago

If they prompt the same and ..., They should.

But they definitely don't taking into account whatever prompts the tools are really using (or ms is using a neutered version to reduce cost). So I would agree with the suggestion. Using sonnet through copilot seems very very different than cursor or cline or Claude code.

Using the same exact model, Copilot consistently often fails to finish tasks or makes a mess. It is consistent at this across ides (ie using the jetbrains plugin generates nearly identical bad results as vscode copilot). I then discard all it did and try the exact same (user) prompt in cursor or Claude code or cline with the same model and it does the same task perfectly.

bongodongobob•1h ago

To me it seems that Opus is really good at writing code if you give it a spec. The other day I had Gpt come up with a spec for a DnD text game that uses the GPT API. It one shotted a 1k line program.

However, if I'm not detailed with it, it does seem to make weird choices that end up being unmaintainable. It's like it has poor creative instincts but is really good at following the directions you give it.

muzani•1h ago

Opus seems to need more babysitting IME, which is great if you are going to actually pair program. Terrible if you like leaving it to do its own thing or try to do multiple things at once.

epolanski•52m ago

That's insightful.

I spend a lot of time planning tasks, generating various documents per pr (requirements, questions, todo), having AI poke my ideas (business/product/ux/code-wise) etc.

After 45 minutes of back and forth in general we end up with a detailed plan.

This has also many benefits: - writing tests becomes very simple (unit, integration, E2Es) - writing documentation becomes very simple - writing meaningful PRs becomes very simple

It is quite boring though, not gonna lie. But that's a price I have accepted for quality.

Also, clearing the ideas so much before hand often leads me to come with creative ideas later in the day, when I go for walks and review mentally what we've done/how.

muzani•43m ago

You might want to try Claude Code if you haven't. It's perfect for exactly this plan, then build flow with a ton of documents. A colleague set up some strict code guidelines, right down to say, put constructors at the top, constants at the bottom, use this name for this, snake case for that. Code quality just shoots up with these details. Can't just hack away with a blunt axe.

People tend to hate Claude Code because it's not vibe coding anymore but it was never really meant to be.

epolanski•39m ago

Yes I use Claude Code a lot, but I'm on the $ 20 tier so I've never seen opus in action (I think it's sonnet only?).

intellectronica•18m ago

Opus costs 10X more. Maybe it's better, but I can't afford to use it, so who cares.

anotheryou•2h ago

I think we need to stop testing models raw.

Claude is trained for claude code and that's how it's used in the field too.

nightshift1•2h ago

unless you use it through copilot

anotheryou•2h ago

Claude? why would you

fouc•2h ago

Claude Code can be used through VS code too. As for why? Some people prefer IDEs over terminals.

Personally I think the attempts to combine LLM coding with current IDE UIs, a la Cursor/Windsurf/VS Code is probably the wrong way to go, it feels too awkward and cumbersome. I like a more interactive interface, and Claude Code is more in line with that.

anotheryou•1h ago

any ide is fine. Quick free edits with windsurf if I'm just too lazy to format css or something is nice

rezistik•2h ago

My work pays for copilot subscriptions, they don't pay for claude code.

anotheryou•1h ago

I see. id still suggest benchmarking the sota setup first

Nizoss•2h ago

I have been using Claude Code with TDD through hooks, which significantly improved my workflow for production code.

Watching the ChatGPT 5 demo yesterday, I noticed most of the code seemed oriented towards one-off scripts rather than maintainable codebases which limits its value for me.

Does anyone know if ChatGPT 5 or Copilot have similar extensibility to enforce practices like TDD?

For context on the approach: https://github.com/nizos/tdd-guard

I use pre/post operation commands to enforce TDD rules.

MrGreenTea•2h ago

I just recently stumbled upon your tdd-guard when looking for inspiration for Claude hooks. I've been so impressed with what it allowed me to improve the workflow and quality. Then I was somewhat disappointed that almost no one seems to talk about this potential and how they're using hooks. Yours was the only interesting project I found in this regard and hope to give it a spin this weekend .

You don't happen to have a short video where you go into a bit more detail on how you use it though?

Nizoss•2h ago

Thank you for the kind words, it means a lot!

I spent my summer holiday on this because I truly believe in the potential of hooks in agentic coding. I'm equally surprised that this space hasn't been explored more.

I'm currently working on making the validation faster and more customizable, plus adding reporters to support more languages.

I think there is an Amazon backed vscode forked that is also exploring this space. I think they market it as spec driven development.

Edit: I found it, its called Kiro: https://kiro.dev/

Nizoss•1h ago

Sorry, I missed the second part of your comment!

I don't have a detailed video beyond the short demo on the repo, but I'll look into recording something more comprehensive or cover it in a blog post. Happy to ping you when it's ready!

In the meantime: I simply set it up and go about my work. The only thing I really do is just nudge the agent into making architectural simplifications and make sure that it follows the testing strategies that I like: dependency injection, test helpers, test data factories and such. Things that I would do regardless of the hook.

I like to give my tests the same attention and care that I give production code. They should be meaningful and resilient. The code base contains plenty of examples but I will look into putting something together.

ethan_smith•30m ago

GPT-5 supports custom function calling which you could use to build similar TDD hooks via the API, though nothing as streamlined as your Claude Code implementation exists out-of-the-box yet.

Nizoss•14m ago

Interesting! So is it the agent that is responsible for triggering them? That is, they are not executed deterministically. Is that correct?

chisleu•2h ago

How was he doing "complex agentic coding" when the APIs have such extreme context and throughput limitations?

DrNosferatu•2h ago

Then instruct GPT5 to write more structured and annotated code.

mewpmewp2•2h ago

So far from my testing I have found Claude Code with Sonnet 4 better than Cursor + GPT-5 still. I started exact same projects at the same time, and it seemed Claude Code was just more reliable. It was just much slower in terms of setting up the project and didn't setup the project up as scalably (despite them highlighting that in the demo), and when I tried to instruct it to set it up DRY, modular, etc it kind of didn't just go where I wanted it to, while Claude Code did.

It was a game involving OOP, three.js. I think both are probably great at good design and CRUD things.

nextworddev•2h ago

GPT-5 is much cheaper though

mewpmewp2•2h ago

I'm using Claude Code which is $200/month, and I do multiple agents, subagents, terminals at the same time, much faster than Cursor. I get almost 24/7 dev time from that.

natiman1000•2h ago

I was initially excited about GPT5, and I quickly switched to it but still can't use it for some reason it is clearly smart but not useful.

mewpmewp2•2h ago

I'm getting the same thing I got with Codex when I tried right now. I give it a command, and it keeps reading files and thinking for 5 min+, this never happens with Claude Code.

dwaltrip•2h ago

Oh I tried codex for the first time last night with gpt-5. It looked stuck twice when it used the grep tool (after working successfully for minute or so), and both times I canceled after seeing no output for more than a minute.

It would have eventually finished?

mewpmewp2•2h ago

I don't know, but since Claude Code didn't do that, I kind of lost patience. Maybe my bad.

visarga•1h ago

I tried GPT-5 in Cursor today and got the same thing - it started reading and thinking and thinking, and it was quite repetitive. It was unexpected. I wrote it off as "maybe the Cursor GPT-5 prompts are not refined yet".

mewpmewp2•1h ago

Yeah, I honestly don't believe the hype for GPT-5 at all, although I do believe LLMs are awesome, but it feels still to me like they are chasing, e.g. Anthropic, but something is fundamentally off. I may be wrong, maybe there are some cases where it will perform much better. But it just doesn't seem as solid to me as Claude still.

nojs•2h ago

This pretty much matches my experience today. GPT5 (in Cursor) feels smarter in isolation, but CC with Opus is faster and better at real tasks involving a large codebase.

mvATM99•2h ago

The manual approval of commands in GHCP can be circumvented, there's an experimental setting that allows you to accept all commands automatically.

I wish you could be a bit more specific though, you can't set which commands you want to auto-accept in detail.

typpilol•1h ago

Pretty sure you can set a terminal whitelist and blacklist for it.

patcon•2h ago

> One continuous difference: while GPT-5 would do lots of thinking then do something right the first time, Claude frantically tried different things — writing code, executing commands, making pretty dumb mistakes [...], but then recovering. This meant it eventually got to correct implementation with many more steps.

Sounds like Claude muddles. I consider that the stronger tactic.

I sure hope GPt-5 is muddling on the backend, else I suspect it will be very brittle.

Re: https://contraptions.venkateshrao.com/p/massed-muddler-intel...

> Lindblom’s paper identifies two patterns of agentic behavior, “root” (or rational-comprehensive) and “branch” (or successive limited comparisons), and argues that in complicated messy circumstances requiring coordinated action at scale, the way actually effective humans operate is the branch method, which looks like “muddling through” but gradually gets there, where the root ["godding through"] method fails entirely.

sudohalt•2h ago

Isn't the issue with that the prohibitive costs, it can easily be 5 to 10 (maybe even more for long running tasks). Currently they are probably subsidizing the compute costs to some extent.

quijoteuniv•2h ago

Today I used GPT-5 for some OpenTelemetry Collector configs that both Claude and OpenAI models struggled with before and it was surprisingly impressive. It got the replies right on the first try. Previously, both had been tripped up by outdated or missing docs (OTel changes so quickly).

For home projects, I wish I could have GPT-5 plugged into Claude’s code CLI interface. iteration just works! Looking forward to less baby sitting in the future!

mattnewton•1h ago

Cursor CLI is pretty close to Claude code- it’s missing a bunch of features like being able to manually compact or define sub agents, but the basic workflow is there and if you squint it’s pretty close to gpt-5 in Claude code.

I haven’t tried codex cli recently yet, I think it just got an update. That would be another to investigate.

ako•1h ago

Claude code router: https://github.com/musistudio/claude-code-router

macawfish•1h ago

Any success using this with GPT-5? I got it set up but haven't had a chance to run it through its paces yet. Seemed like it was more or less working when I tried it out, but GPT-5 was much less transparent about progress.

kiitos•1h ago

yeah gpt-5 does lots of thinking and then does something -- but it's rarely the right thing, at least in my experience over the past day

endorphine•2h ago

What is the way to use this agentic stuff with neovim? Do I have to resort to OpenAI's Codex or a nvim plugin is sufficient? Or Claude Code?

mkozlows•2h ago

Claude Code or cursor-cli or Codex or any of the command-line tools should be good. (Claude Code seems so far to be the option people like best of those, though.)

Cyphus•1h ago

I've been using codecompanion.nvim[0] combined with mcp-hub.nvim[1]. Code Companion works well for interactive chat but falls short for agentic coding. It's limited to some pre-configured and user-defined "workflows" which are basically templated steps with prompts, actions, and loops.

I've been meaning to give avante.nvim[2] a try since it aims to provide a "Cursor like" experience, but for now I've been alternating between Code Companion for simple prompts and Claude CLI (in a tmux pane next to Neovim) for agentic stuff.

[0] https://codecompanion.olimorris.dev/

[1] https://ravitemer.github.io/mcphub.nvim/

[2] https://github.com/yetone/avante.nvim

roguesherlock•1h ago

I've found these two to be really good

https://github.com/dlants/magenta.nvim

https://github.com/NickvanDyke/opencode.nvim

h4ny•2h ago

I have been seeing different people reporting different results with different tasks. Watched a live stream that compared GPT-5, Gemini Pro 2.5, Claude 4 Sonnet, and GLM 4.5, and GPT-5 appeared to not follow instructions as well as the other three.

At the moment it feels like most people "reviewing" models depends on their believes and agenda, and there are no objective ways to evaluate and compare models (many benchmarks can be gamed).

The blurring boundaries between technical overview, news, opinions and marketing is truly concerning.

x187463•1h ago

This has been ubiquitous for a while. Even here on HN every thread about these models (even this one, I'm sure) features an inordinate amount of disagreement between people vehemently declaring one model more useful than another. There truly seems to be no objective measurement of quality that can discern the difference between frontier models.

physix•1h ago

I think this is actually good, because it means there is no clear winner who can sit back and demand rent. Instead they all work as hard as they can to stay competitive, hopefully thereby accelerating AI software engineering capabilities, with the investors footing the bill.

NitpickLawyer•1h ago

Yeah, I agree. And prices are slowly coming down. Gemini 2.5 was cheaper than claude4, and (again depending on task) either on par or slightly below in quality. Now gpt5 is cheaper still (I think their -main is 10$/M?) and they also have -mini and -nano versions. The more choices we have the better it will be. As you said, without a clear winner we're about to get spoiled for choice, and there's no clear way for them to just sit on stuff and increase prices (yet). Plus there's some pressure coming from the open source releases. Not there in quality, but they are runnable "on prem", pretty cheap and keep getting better.

isaacremuant•1h ago

> The blurring boundaries between technical overview, news, opinions and marketing is truly concerning.

Can't help but laugh at this. It's like you just discovered skepticism and how the world actually works.

muzani•1h ago

If you were to objectively rank things, durian would be the best fruit in the world, python would be the best programming language, and the Tesla Model Y is the best car. Everyone has multiple inconsistent opinions on everything because everything is not the same.

Just pick something and use it. AI models are interchangeable. It's not as big a decision as buying a car or even a durian.

jjfoooo4•1h ago

There's certainly a symbiosis blog publishers and small startups wanting to be perceived as influential, and big companies releasing models and wanting favorable coverage.

I heavily discount same day commentary, there's a quid pro quo on early access vs favorable reviews (and yes, folks publishing early commentary aren't explicitly agreeing to write favorable things, but there's obvious bias baked in).

I don't think it's all particularly concerning, you can discount reviews that are coming out so quickly that's it's unlikely the reviewer has really used it very much.

vineyardmike•1h ago

> At the moment it feels like most people "reviewing" models depends on their believes and agenda, and there are no objective ways to evaluate and compare models

I think you’ll always have some disagreement generally in life, but especially for things like this. Code has a level of subjectivity. Good variable names, correct amount of abstraction, verbosity, over complexity, etc are at least partially opinions. That makes benchmarking something subjective tough. Furthermore, LLMs aren’t deterministic, and sometimes you just get a bad seed in the RNG.

Not only that, but the harness and prompt used to guide the model make a difference. Claude responds to the word “ultrathink”, but if GPT-5 uses “think harder”, then what should be in the prompt?

Anecdotally, I’ve had the best luck with agentic coding when using Claude Code with Sonnet. Better than Sonnet with other tools, and better than Claude Code with other models. But I mostly use Go and Dart and I aggressively manage the context. I’ve found GPTs can’t write zig at all, but Gemini can, but they can both write python excellently. All that said, if I didn’t like an answer, I’d prompt again, but liked the answer, never tried again with a different model to see if I’d like it even more. So it’s hard to know what could’ve been.

I’ve used a ton of models and harnesses. Cursor is good too, and I’ve been impressed with more models in cursor. I don’t get the hype of Qwen though because I’ve found it makes lots of small(er) changes in a loop, and that’s noisy and expensive. Gemini is also very smart but worse at following my instructions, but I never took the time to experiment with prompting.

epolanski•59m ago

I will also state another semi-obvious thing that people seem to consistently forget: models are non deterministic.

You are not going to get the same output from GPT5 or Sonnet every time.

And this obviously compounds across many different steps.

E.g. give GPT5 the code to a feature (by pointing some files and tests) and tell it to review it and find improvement opportunities and write them down: depending on the size of the code, etc, the answers will slightly different.

I often do it in Cursor by having multiple agents review a PR and each of them: - has to write down their pr-number-review-model.md (e.g. pr-15-review-sonnet4.md) - has to review the reviews of the other files

Then I review it myself and try to decide what's valuable in there and what not. And to my disappointment (towards myself): - often they do point to valid flaws I would've not thought about - miss the "end-to-end" or general view of how the code fits in a program/process/business. What do I mean: sometimes the real feedback would be that we don't need it at all. But you need to have these conversations with AI earlier.

qsort•58m ago

Thankfully that isn't a problem: we have scientific and reliable benchmarks to cut through the nonsense! Oh wait...

olddustytrail•1h ago

From reactions I've seen it appears that GPT-5 hallucinates less than previous models but the flip side is that it's worse for creative tasks.

This makes logical sense: you don't want a model to get creative if you need functioning code, but if you want a story idea it should basically be all hallucination.

I think it makes sense to have different models for these tasks.

lherron•1h ago

Did I miss the total cost for each run in the article? Can't seem to find it.

If Sonnet is more expensive AND more chatty/requires more attempts for the same result, seems like that would favor GPT5 for daily driver.

animex•1h ago

Wonderful, timely article. It sounds like a hybrid approach might produce good results: Using ChatGPT-5 for planning/analysis and using Claude for execution.

macawfish•44m ago

Another option would be to add something to your AGENTS.md or whatever giving examples of the kind of code organization you want. You could ask Claude to explain its approach in terms explicitly that GPT-5 can understand. GPT-5 seems much more sensitive in its responsiveness to instructions. My sense is that in the long run this will be really nice, but that the current prompts in these mainstream LLM coding tools are designed for models with a different style of responsiveness to instructions.

Surac•1h ago

Typescript to rust. I mostly test models on c code. C is much less boilerplate and more code per word. Models need to be ready smart to see all the pointer magic and misuse of lib functions. Claude really makes a very competent c coder in my test

doctoboggan•1h ago

I really like Claude code's context engineering and prompt engineering, is it possible to plug in GPT-5 into Claude code? I think that would be a more apples to apples test as it's just testing the models and not the agentic framework around them.

koakuma-chan•1h ago

I imagine Claude Code is optimized for Claude specifically, and GPT-5 would not be great there. You should probably use Codex if you want to use GPT-5.

jjani•49m ago

Is it really this easy now to get your article high on HN with 100 comments? The findings are completely meaningless.

"Agenticness" depends so much on the specific tooling (harness) and system prompts. It mentions Copilot - did it use this for both? Given it's created by Microsoft there's good reason to believe it'd be built yo do especially well with GPT (they'll have had 5 available in preview for months by now). Or it could be the opposite and be tuned towards Sonnet. At the very minimum you'd need to try a few different harnesses, preferably ones not closely related to either OpenAI/MS or Anthropic.

This article even mentions things like "Sonnet is much faster" which is very dependent on the specific load at the time of usage. Today everyone is testing GPT-5 so it's slow and Sonnet is much faster.

debarshri•27m ago

I guess agents are voting it up

intellectronica•19m ago

OP here. I actually agree with you that the "findings" here are meaningless. This is pure vibe.

Also regarding "Sonnet is faster" I did explicitly mention that I believe this is because GPT-5 is in preview and hours from the release. The speed I experienced doesn't say anything about the model performance you can expect.

ramesh31•13m ago

>Is it really this easy now to get your article high on HN with 100 comments?

Everyone wants to know the answer to GPT5 vs Claude without wasting the tokens personally because we can all more or less guess what the result will be.

ramoz•35m ago

Not using Claude code is a crime.

ChatGPT-5 Can't Do Basic Math

Security alerts in Gmail. What a mess

GPT-5 AMA

Johns Hopkins is building its AI wargaming tools for DoD

Fears of population collapse in the US are based on faulty assumptions

GPT-5 Rollout Updates

Cordoomceps – replacing an Amiga's brain with Doom

Millions are flocking to grow virtual gardens in Roblox game created by teenager

The Illustrated TLS 1.2 Connection

The surprising economics of the meat industry – Lewis Bollard

Job growth has slowed sharply; the question is why

Campaigning for Extinction:Eradication of Sparrows and the Great Famine in China

GRETA to Open a New Eye on the Nucleus

HTTP Is Not Simple

Looking for Testers for an AI Privacy Platform

Three Tiers of Responses to Fact

Toxic convenience: what science tells us about plastic's hidden costs

ChatGPT users hate GPT-5's overworked secretary energy, miss their GPT-4o buddy

Welcome to DIY Rich Guy Fantasy Camp

FIN - Fish Extensible Text Editor Written in Fish

json2dir: a JSON-to-directory converter, a fast alternative to home-manager

M5 MacBook Pro No Longer Coming in 2025

(Evil)Doggie: An open-source CAN bus research and penetration testing tool

LVFS Sustainability Plan

Query-Mutating Data Race in Go

How Samsung Missed the AI Moment [video]

uses this

The Mother of All Currency Crises Is on the Horizon

Mary Shields, First Woman to Finish the Iditarod, Dies at 80

Omron took AppleHealth data without consent then silently updated privacy policy

ChatGPT-5 Can't Do Basic Math

Security alerts in Gmail. What a mess

GPT-5 AMA

Johns Hopkins is building its AI wargaming tools for DoD

Fears of population collapse in the US are based on faulty assumptions

GPT-5 Rollout Updates

Cordoomceps – replacing an Amiga's brain with Doom

Millions are flocking to grow virtual gardens in Roblox game created by teenager

The Illustrated TLS 1.2 Connection

The surprising economics of the meat industry – Lewis Bollard

Job growth has slowed sharply; the question is why

Campaigning for Extinction:Eradication of Sparrows and the Great Famine in China

GRETA to Open a New Eye on the Nucleus

HTTP Is Not Simple

Looking for Testers for an AI Privacy Platform

Three Tiers of Responses to Fact

Toxic convenience: what science tells us about plastic's hidden costs

ChatGPT users hate GPT-5's overworked secretary energy, miss their GPT-4o buddy

Welcome to DIY Rich Guy Fantasy Camp

FIN - Fish Extensible Text Editor Written in Fish

json2dir: a JSON-to-directory converter, a fast alternative to home-manager

M5 MacBook Pro No Longer Coming in 2025

(Evil)Doggie: An open-source CAN bus research and penetration testing tool

LVFS Sustainability Plan

Query-Mutating Data Race in Go

How Samsung Missed the AI Moment [video]

uses this

The Mother of All Currency Crises Is on the Horizon

Mary Shields, First Woman to Finish the Iditarod, Dies at 80

Omron took AppleHealth data without consent then silently updated privacy policy

GPT-5 vs. Sonnet: Complex Agentic Coding

Comments