I really trying to not be annoyed by Claude’s “You’re absolutely right” because I know I cannot control it but this is an increasingly difficult task.
an intern never says that. they say "oh, I see."
Answer concisely when appropriate, more
extensively when necessary. Avoid rhetorical
flourishes, bonhomie, and (above all) cliches.
Take a forward-thinking view. OK to be mildly
positive and encouraging but NEVER sycophantic
or cloying. Above all, NEVER use the phrase
"You're absolutely right." Rather than "Let
me know if..." style continuations, list a
set of prompts to explore further topics.
That last bit causes some clutter at the end of each response, not sure if I'm going to keep it. But it does do a good job at following these guidelines in my experience. The same basic instructions also work well in ChatGPT and Gemini.Does Claude Code not support anything like this?
I find it does decently, but it's far from perfect. Eg before hooks i had "ALWAYS format and lint" style entries in the memory file and it probably had a 70% success rate. Often it would go on little side paths cleaning work up and forget to run lints after or w/e. Formatting was my biggest gripe.
Deterministic wrappers have been the biggest gain for me personally. I suspect they'll get a lot better over time too. Eg i want to try and find a way to write more personal style guides in a linter to enforce claude not break various conventions i prefer. But of course, that can be difficult.
For reference my Claude usage was mostly Sonnet, but with consulting from Opus.
GPT5, Sonnet 4, and Gemini Pro 2.5 are all 1x. Opus is 10x, for comparison.
https://docs.github.com/en/copilot/reference/ai-models/suppo...
Also worth keeping in mind that Copilot has reduced context windows even for the premium models, which has a very real impact on agentic performance.
I use Copilot because work is paying for it and it can be made usable, but requires being really deliberate about managing context to keep things on the rails. It's nice that it gives you access to a pretty decent selection of models, though.
At home, I'm mostly using the $100 Claude plan. It's definitely not cheap, but I've found it has a pretty decent balance for my casual experiments with agentic coding.
Another option to seriously consider is setting up an account with OpenRouter and just tossing some cash into your bucket on occasion. OpenRouter lets you arbitrarily make API requests to pretty much any model you want. I've been occasionally tossing $10 or so into mine and I'll use it when I've hit my usage limits with Claude or if I want to see how another model will attack a particular task.
FWIW, I use Roo code for all of this, so it's pretty easy for me to switch between models/providers as I need to.
Unlike some other workarounds this is a fully supported workflow and does not break Copilot terms of service with reasonable personal usage. (As far as I understand at least. Copilot has full visibility into which tools are using it to make chat requests so it isn't disguising or impersonating Copilot itself. When first setting it up there's a native VS Code approval prompt to allow tool access to Copilot and the LM API is publicly documented).
But anything unlimited in the LLM space feels like it's on borrowed time, especially with 3rd party tool support, so I wouldn't be surprised if they impose stricter quotas for the LM API in the future or remove the unlimited limit entirely).
I think this is the whole reason not to compare it to Opus...
> Our smartest, fastest, and most useful model yet
I'd say it's definitely supposed to be the best, it just doesn't deliver.
> I'd say it's definitely supposed to be the best, it just doesn't deliver.
What part of "Our" is difficult to understand in that statement? Or are you claiming that OpenAI owns another model that is clearly better than GPT-5?
I would suggest reading the entire comment thread before attacking people.
I believe Opus starts at $20 a month, similar to GPT5 if you want more than just cursory usage.
Or am I missing something?
Claude Opus 4.1
Most intelligent model for complex tasks
Input $15 / MTok
Output $75 / MTok
Prompt caching
Write $18.75 / MTok
Read $1.50 / MTok
It would be useful to be able to easily compare what it costs across the big providers: Gemini, Grok, Claude, ChatGPT.
If you want to use Opus in claude code, you've got to get the $100/month plan - or pay API prices. And agentic coding uses a lot of tokens.
Also keep in mind that many employees are not paying out of pocket for LLM use at work. A $1,000 monthly bill for LLM usage is high for an individual but not so much for a company that employees engineers.
They're impressive despite that. But if Sonnet is $20/month and I have to intervene every 3 minutes, while Opus is $100/month and I have to intervene every 5 minutes? ¯\_(ツ)_/¯
Inverting the problem, one might ask how best to spend (say) $5,000 monthly on coding agents. I don't know the answer to that.
So do engineers.
The difference is that IRL engineers know a lot about the context of the business, features, product, ux, stakeholders, expectations, etc, etc which means that the hand-holding is a long running process.
LLMs need all of these things to be clearly written down and specified in one shot.
In one instance, I asked it to optimize a roughly 80 line C# method that matches some object positions by object ID and delta encodes their positions from the previous frame. It seemed to be confused about how all this should work and output completely wrong code. It has all the context it needs in the file and the method is fairly self-contained. Other models did much better. GPT-5 understood what to do immediately.
I tried a few other tasks/questions that also had underwhelming results. Now I've switched to using GPT-5.
If you have a quick prompt you'd like me to try, I can share the results.
But they definitely don't taking into account whatever prompts the tools are really using (or ms is using a neutered version to reduce cost). So I would agree with the suggestion. Using sonnet through copilot seems very very different than cursor or cline or Claude code.
Using the same exact model, Copilot consistently often fails to finish tasks or makes a mess. It is consistent at this across ides (ie using the jetbrains plugin generates nearly identical bad results as vscode copilot). I then discard all it did and try the exact same (user) prompt in cursor or Claude code or cline with the same model and it does the same task perfectly.
However, if I'm not detailed with it, it does seem to make weird choices that end up being unmaintainable. It's like it has poor creative instincts but is really good at following the directions you give it.
I spend a lot of time planning tasks, generating various documents per pr (requirements, questions, todo), having AI poke my ideas (business/product/ux/code-wise) etc.
After 45 minutes of back and forth in general we end up with a detailed plan.
This has also many benefits: - writing tests becomes very simple (unit, integration, E2Es) - writing documentation becomes very simple - writing meaningful PRs becomes very simple
It is quite boring though, not gonna lie. But that's a price I have accepted for quality.
Also, clearing the ideas so much before hand often leads me to come with creative ideas later in the day, when I go for walks and review mentally what we've done/how.
People tend to hate Claude Code because it's not vibe coding anymore but it was never really meant to be.
Claude is trained for claude code and that's how it's used in the field too.
Personally I think the attempts to combine LLM coding with current IDE UIs, a la Cursor/Windsurf/VS Code is probably the wrong way to go, it feels too awkward and cumbersome. I like a more interactive interface, and Claude Code is more in line with that.
Watching the ChatGPT 5 demo yesterday, I noticed most of the code seemed oriented towards one-off scripts rather than maintainable codebases which limits its value for me.
Does anyone know if ChatGPT 5 or Copilot have similar extensibility to enforce practices like TDD?
For context on the approach: https://github.com/nizos/tdd-guard
I use pre/post operation commands to enforce TDD rules.
You don't happen to have a short video where you go into a bit more detail on how you use it though?
I spent my summer holiday on this because I truly believe in the potential of hooks in agentic coding. I'm equally surprised that this space hasn't been explored more.
I'm currently working on making the validation faster and more customizable, plus adding reporters to support more languages.
I think there is an Amazon backed vscode forked that is also exploring this space. I think they market it as spec driven development.
Edit: I found it, its called Kiro: https://kiro.dev/
I don't have a detailed video beyond the short demo on the repo, but I'll look into recording something more comprehensive or cover it in a blog post. Happy to ping you when it's ready!
In the meantime: I simply set it up and go about my work. The only thing I really do is just nudge the agent into making architectural simplifications and make sure that it follows the testing strategies that I like: dependency injection, test helpers, test data factories and such. Things that I would do regardless of the hook.
I like to give my tests the same attention and care that I give production code. They should be meaningful and resilient. The code base contains plenty of examples but I will look into putting something together.
It was a game involving OOP, three.js. I think both are probably great at good design and CRUD things.
It would have eventually finished?
I wish you could be a bit more specific though, you can't set which commands you want to auto-accept in detail.
Sounds like Claude muddles. I consider that the stronger tactic.
I sure hope GPt-5 is muddling on the backend, else I suspect it will be very brittle.
Re: https://contraptions.venkateshrao.com/p/massed-muddler-intel...
> Lindblom’s paper identifies two patterns of agentic behavior, “root” (or rational-comprehensive) and “branch” (or successive limited comparisons), and argues that in complicated messy circumstances requiring coordinated action at scale, the way actually effective humans operate is the branch method, which looks like “muddling through” but gradually gets there, where the root ["godding through"] method fails entirely.
For home projects, I wish I could have GPT-5 plugged into Claude’s code CLI interface. iteration just works! Looking forward to less baby sitting in the future!
I haven’t tried codex cli recently yet, I think it just got an update. That would be another to investigate.
I've been meaning to give avante.nvim[2] a try since it aims to provide a "Cursor like" experience, but for now I've been alternating between Code Companion for simple prompts and Claude CLI (in a tmux pane next to Neovim) for agentic stuff.
[0] https://codecompanion.olimorris.dev/
At the moment it feels like most people "reviewing" models depends on their believes and agenda, and there are no objective ways to evaluate and compare models (many benchmarks can be gamed).
The blurring boundaries between technical overview, news, opinions and marketing is truly concerning.
Can't help but laugh at this. It's like you just discovered skepticism and how the world actually works.
Just pick something and use it. AI models are interchangeable. It's not as big a decision as buying a car or even a durian.
I heavily discount same day commentary, there's a quid pro quo on early access vs favorable reviews (and yes, folks publishing early commentary aren't explicitly agreeing to write favorable things, but there's obvious bias baked in).
I don't think it's all particularly concerning, you can discount reviews that are coming out so quickly that's it's unlikely the reviewer has really used it very much.
I think you’ll always have some disagreement generally in life, but especially for things like this. Code has a level of subjectivity. Good variable names, correct amount of abstraction, verbosity, over complexity, etc are at least partially opinions. That makes benchmarking something subjective tough. Furthermore, LLMs aren’t deterministic, and sometimes you just get a bad seed in the RNG.
Not only that, but the harness and prompt used to guide the model make a difference. Claude responds to the word “ultrathink”, but if GPT-5 uses “think harder”, then what should be in the prompt?
Anecdotally, I’ve had the best luck with agentic coding when using Claude Code with Sonnet. Better than Sonnet with other tools, and better than Claude Code with other models. But I mostly use Go and Dart and I aggressively manage the context. I’ve found GPTs can’t write zig at all, but Gemini can, but they can both write python excellently. All that said, if I didn’t like an answer, I’d prompt again, but liked the answer, never tried again with a different model to see if I’d like it even more. So it’s hard to know what could’ve been.
I’ve used a ton of models and harnesses. Cursor is good too, and I’ve been impressed with more models in cursor. I don’t get the hype of Qwen though because I’ve found it makes lots of small(er) changes in a loop, and that’s noisy and expensive. Gemini is also very smart but worse at following my instructions, but I never took the time to experiment with prompting.
You are not going to get the same output from GPT5 or Sonnet every time.
And this obviously compounds across many different steps.
E.g. give GPT5 the code to a feature (by pointing some files and tests) and tell it to review it and find improvement opportunities and write them down: depending on the size of the code, etc, the answers will slightly different.
I often do it in Cursor by having multiple agents review a PR and each of them: - has to write down their pr-number-review-model.md (e.g. pr-15-review-sonnet4.md) - has to review the reviews of the other files
Then I review it myself and try to decide what's valuable in there and what not. And to my disappointment (towards myself): - often they do point to valid flaws I would've not thought about - miss the "end-to-end" or general view of how the code fits in a program/process/business. What do I mean: sometimes the real feedback would be that we don't need it at all. But you need to have these conversations with AI earlier.
This makes logical sense: you don't want a model to get creative if you need functioning code, but if you want a story idea it should basically be all hallucination.
I think it makes sense to have different models for these tasks.
If Sonnet is more expensive AND more chatty/requires more attempts for the same result, seems like that would favor GPT5 for daily driver.
"Agenticness" depends so much on the specific tooling (harness) and system prompts. It mentions Copilot - did it use this for both? Given it's created by Microsoft there's good reason to believe it'd be built yo do especially well with GPT (they'll have had 5 available in preview for months by now). Or it could be the opposite and be tuned towards Sonnet. At the very minimum you'd need to try a few different harnesses, preferably ones not closely related to either OpenAI/MS or Anthropic.
This article even mentions things like "Sonnet is much faster" which is very dependent on the specific load at the time of usage. Today everyone is testing GPT-5 so it's slow and Sonnet is much faster.
Also regarding "Sonnet is faster" I did explicitly mention that I believe this is because GPT-5 is in preview and hours from the release. The speed I experienced doesn't say anything about the model performance you can expect.
Everyone wants to know the answer to GPT5 vs Claude without wasting the tokens personally because we can all more or less guess what the result will be.
arcticfox•3h ago
Well - I would have been interested in GPT-5 vs. Opus. Claude Code Max is affordable with Opus.
swader999•3h ago
rubslopes•1h ago
intellectronica•18m ago
qeternity•2h ago
Because Anthropic is presumably massively subsidizing the usage.
kvirani•2h ago
adventured•1h ago
kingstnap•16m ago
The training and researches are very expensive. The fixed price subscriptions are 100% a sweetheart deal.
Filligree•1h ago