Yes, but it does worse than o3 on the airline version of that benchmark. The prose is totally cherry picker.
What eval is tracking that? It seems like it's potentially the most imporatnt metric for real-world software engineering and not one-shot vibe prayers.
don’t have long-running tasks, llms or not. break the problem down into small manageable chunks and then assemble it. neither humans nor llms are good at long-running tasks.
If LLMs are going to act as agents, they need to maintain context across these chunks.
That's a wild comparison to make. I can easily work for an hour. Cursor can hardly work for a continuous pomodoro. "Long-running" is not a fixed size.
LLMs multiply errors over time.
Claud always misunderstands how API exported by my service works and every compaction it forgets all over and commits "oh api has change since last time I've used, let me use different query parameters", my brother Christ nothing has changed, and you are the one who made this API.
To get great results, it's still very important to manage context well. It doesn't matter if the model allows a very large context window, you can't just throw in the kitchen sink and expect good results
If there's no substantial difference in software development expertise then GPT-5 absolutely blows Opus out of the water due to being almost 10x cheaper.
>"GPT‑5 is the strongest coding model we’ve ever released. It outperforms o3 across coding benchmarks and real-world use cases, and has been fine-tuned to shine in agentic coding products like Cursor, Windsurf, GitHub Copilot, and Codex CLI. GPT‑5 impressed our alpha testers, setting records on many of their private internal evals."
For my use cases, this is mostly needing to be really home in on relevant code files, issues, discussions, PRs. I'm hopeful that GPT5 will be a step forward in this regard that isn't fully captured in the benchmark results. It's certainly promising that it can achieve similar results more cheaply than e.g. Opus.
https://charlielabs.ai/research/gpt-5
Often, our tasks take 30-45 minutes and can handle massive context threads in Linear or Github without getting tripped up by things like changes in direction part of the way through the thread.
While 10 issues isn't crazy comprehensive, we found it to be directionally very impressive and we'll likely build upon it to better understand performance going forward.
Anecdotally, the tool updates in the latest Cursor (1.4) seem to have made tool usage in models like Gemini much more reliable. Previously it would struggle to make simple file edits, but now the edits work pretty much every time.
I find that OpenAI's reasoning models write better code and are better at raw problem solving, but Claude code is a much more useful product, even if the model itself is weaker.
I think the dev workflow is going to fundamentally change because to maximise productivity out of this you need to get multiple AIs working in parallel so rather than just jumping straight into coding we're going to end up writing a bunch of tickets out in a PM tool (Linear[3] looks like it's winning the race atm) and then working out (or using the AI to work out) which ones can be run in parallel without causing merge conflicts and then pulling multiple tickets into your IDE/Terminal and then cycling through the tabs and jumping in as needed.
Atm I'm still not really doing this but I know I need to make the switch and I'm thinking that Warp[4] might be best suited for this kind of workflow, with the occasional switch over to an IDE when you need to jump in and make some edits.
Oh also, to achieve this you need to use git worktrees[5,6,7].
[1]: https://www.youtube.com/watch?v=gZ4Tdwz1L7k
[3]: https://linear.app/
[5]: https://docs.anthropic.com/en/docs/claude-code/common-workfl...
Spend 1.5 hours now to learn from an experienced dev on a stack that is better suited for job: most likely future hours gained.
EDIT: It's out now
The basic idea is: at each auto-regressive step (each token generation), instead of letting the model generate a probability distribution over "all tokens in the entire vocab it's ever seen" (the default), only allow the model to generate a probability distribution over "this specific set of tokens I provide". And that set can change from one sampling set to the next, according to a given grammar. E.g. if you're using a JSON grammar, and you've just generated a `{`, you can provide the model a choice of only which tokens are valid JSON immediately after a `{`, etc.
[0] https://github.com/dottxt-ai/outlines [1] https://github.com/guidance-ai/guidance
To achieve AGI, we will need to be capable of high fidelity whole brain simulations that model the brain's entire physical, chemical, and biological behavior. We won't have that kind of computational power until quantum computers are mature.
That's what I hear when people say stuff like this anyway.
Similar to CS folks throwing around physics 'theories'
Maybe your point is that until we understand our own intelligence, which would be reflected in such a simulation, it would be difficult to improve upon it.
Both of those seem questionable, multiplying them together seems highly unlikely.
Input: $1.25 / 1M tokens (cached: $0.125/1Mtok) Output: $10 / 1M tokens
For context, Claude Opus 4.1 is $15 / 1M for input tokens and $75/1M for output tokens.
The big question remains: how well does it handle tools? (i.e. compared to Claude Code)
Initial demos look good, but it performs worse than o3 on Tau2-bench airline, so the jury is still out.
It's interesting that they're using flat token pricing for a "model" that is explicitly made of (at least) two underlying models, one with much lower compute costs than the other; and with use ability to at least influence (via prompt) if not choose which model is being used. I have to assume this pricing model is based on a predicted split between how often the underlying models get used; I wonder if that will hold up, if users will instead try to rouse the better model into action more than expected, or if the pricing is so padded that it doesn't matter.
what do you mean?
So, at least twice larger context than those
[^1]: https://github.com/guidance-ai/llguidance/blob/f4592cc0c783a...
I'm already running into a bunch of issues with the structured output APIs from other companies like Google and OpenAI have been doing a great job on this front.
This run-on sentence swerved at the end; I really can't tell what your point is. Could you reword it for clarity?
I’m not sure of the utility of being so outraged that some people made wrong predictions.
I used gpt-5-mini with reasoning_effort="minimal", and that model finally resisted a hallucination that every other model generated.
Screenshot in post here: https://bsky.app/profile/pamelafox.bsky.social/post/3lvtdyvb...
I'll run formal evaluations next.
The new training rewards that suppress hallucinations and tool-skipping hopefully push us in the right direction.
GPT4: Collaborating with engineering, sales, marketing, finance, external partners, suppliers and customers to ensure …… etc
GPT5: I don't know.
Upon speaking these words, AI was enlightened.
It was (attempted to be) solved by a human before, yet not merged... With all the great coding models OpenAI has access to, their SDK team still feels too small for the needs.
Looks like they're trying to lock us into using the Responses API for all the good stuff.
https://x.com/elonmusk/status/1953509998233104649
Anyone know why he said that?
That's really interesting to me. Looking forward to trying GPT-5!
why isn't it on https://aider.chat/docs/leaderboards/?
"last updated August 07, 2025"
hmm, they should call it gpt-5-chat-nonreasoning or something.
https://extraakt.com/extraakts/openai-s-gpt-5-performance-co...
andrewmcwatters•2h ago
I almost exclusively wrote and released https://github.com/andrewmcwattersandco/git-fetch-file yesterday with GPT 4o and Claude Sonnet 4, and the latter's agentic behavior was quite nice. I barely had to guide it, and was able to quickly verify its output.