Yes, but it does worse than o3 on the airline version of that benchmark. The prose is totally cherry picker.
What eval is tracking that? It seems like it's potentially the most imporatnt metric for real-world software engineering and not one-shot vibe prayers.
don’t have long-running tasks, llms or not. break the problem down into small manageable chunks and then assemble it. neither humans nor llms are good at long-running tasks.
If LLMs are going to act as agents, they need to maintain context across these chunks.
That's a wild comparison to make. I can easily work for an hour. Cursor can hardly work for a continuous pomodoro. "Long-running" is not a fixed size.
LLMs multiply errors over time.
To get great results, it's still very important to manage context well. It doesn't matter if the model allows a very large context window, you can't just throw in the kitchen sink and expect good results
If there's no substantial difference in software development expertise then GPT-5 absolutely blows Opus out of the water due to being almost 10x cheaper.
>"GPT‑5 is the strongest coding model we’ve ever released. It outperforms o3 across coding benchmarks and real-world use cases, and has been fine-tuned to shine in agentic coding products like Cursor, Windsurf, GitHub Copilot, and Codex CLI. GPT‑5 impressed our alpha testers, setting records on many of their private internal evals."
Anecdotally, the tool updates in the latest Cursor (1.4) seem to have made tool usage in models like Gemini much more reliable. Previously it would struggle to make simple file edits, but now the edits work pretty much every time.
I find that OpenAI's reasoning models write better code and are better at raw problem solving, but Claude code is a much more useful product, even if the model itself is weaker.
EDIT: It's out now
The basic idea is: at each auto-regressive step (each token generation), instead of letting the model generate a probability distribution over "all tokens in the entire vocab it's ever seen" (the default), only allow the model to generate a probability distribution over "this specific set of tokens I provide". And that set can change from one sampling set to the next, according to a given grammar. E.g. if you're using a JSON grammar, and you've just generated a `{`, you can provide the model a choice of only which tokens are valid JSON immediately after a `{`, etc.
[0] https://github.com/dottxt-ai/outlines [1] https://github.com/guidance-ai/guidance
To achieve AGI, we will need to be capable of high fidelity whole brain simulations that model the brain's entire physical, chemical, and biological behavior. We won't have that kind of computational power until quantum computers are mature.
Input: $1.25 / 1M tokens (cached: $0.125/1Mtok) Output: $10 / 1M tokens
For context, Claude Opus 4.1 is $15 / 1M for input tokens and $75/1M for output tokens.
The big question remains: how well does it handle tools? (i.e. compared to Claude Code)
Initial demos look good, but it performs worse than o3 on Tau2-bench airline, so the jury is still out.
It's interesting that they're using flat token pricing for a "model" that is explicitly made of (at least) two underlying models, one with much lower compute costs than the other; and with use ability to at least influence (via prompt) if not choose which model is being used. I have to assume this pricing model is based on a predicted split between how often the underlying models get used; I wonder if that will hold up, if users will instead try to rouse the better model into action more than expected, or if the pricing is so padded that it doesn't matter.
what do you mean?
So, at least twice larger context than those
[^1]: https://github.com/guidance-ai/llguidance/blob/f4592cc0c783a...
I'm already running into a bunch of issues with the structured output APIs from other companies like Google and OpenAI have been doing a great job on this front.
I used gpt-5-mini with reasoning_effort="minimal", and that model finally resisted a hallucination that every other model generated.
Screenshot in post here: https://bsky.app/profile/pamelafox.bsky.social/post/3lvtdyvb...
I'll run formal evaluations next.
The new training rewards that suppress hallucinations and tool-skipping hopefully push us in the right direction.
GPT4: Collaborating with engineering, sales, marketing, finance, external partners, suppliers and customers to ensure …… etc
GPT5: I don't know.
Upon speaking these words, AI was enlightened.
You can't make this up
It was (attempted to be) solved by a human before, yet not merged... With all the great coding models OpenAI has access to, their SDK team still feels too small for the needs.
Looks like they're trying to lock us into using the Responses API for all the good stuff.
andrewmcwatters•1h ago
I almost exclusively wrote and released https://github.com/andrewmcwattersandco/git-fetch-file yesterday with GPT 4o and Claude Sonnet 4, and the latter's agentic behavior was quite nice. I barely had to guide it, and was able to quickly verify its output.