> The specifics are unknown, but they might...
Hold up.
> but some assume that they do it this way.
Come on now.
https://x.com/michpokrass/status/1869102222598152627
It says:
> hey aidan, not a miscommunication, they are different products! o1 pro is a different implementation and not just o1 with high reasoning.
Sounds like it is just o3 with higher thinking budget to me
So far I've only used o3-pro a bit today, and it's a bit too heavy to use interactively (fire it off, revisit in 10-15 minutes), but it seems to generate much cleaner/more well organized code and answers.
I feel like the benchmarks aren't really doing a good job at capturing/reflecting capabilities atm. eg, while Claude 4 Sonnet appears to score about as well as Opus 4, in my usage Opus is always significantly better at solving my problem/writing the code I need.
Besides especially complex/gnarly problems, I feel like a lot of the different models are all good enough and it comes down to reliability. For example, I've stopped using Claude for work basically because multiple times now it's completely eaten my prompts and even artifacts it's generated. Also, it hits limits ridiculously fast (and does so even when on network/resource failures).
I use 4.1 as my workhorse for code interpreter work (creating graphs/charts w/ matplotlib, basic df stuff, converting tables to markdown) as it's just better integrated than the others and so far I haven't caught 4.1 transposing/having errors with numbers (which I've noticed w/ 4o and Sonnet).
Having tested most of the leading edge open and closed models a fair amount, 4.5 is still my current preferred model to actually talk to/make judgement calls (particularly with translations). Again, not reflected in benchmarks, but 4.5 is the only model that gives me the feeling I had when first talking to Opus 3 (eg, of actual fluid intelligence, and a pleasant personality that isn't overly sychophantic) - Opus 4 is a huge regression in that respect for me.
(I also use Codex, Roo Code, Windsurf, and a few other API-based tools, but tbt, OpenAI's ChatGPT UI is generally better for how I leverage the models in my workflow.)
For the past year or two, I've had my own personal 25 question vibe-check I've used on new models to kick the tires, but I think the future is something both a little more rigorous and a little more automated (something like LLM Jury w/ an UltraFeedback criteria based off of your own real world exchanges and then BTL ranked)? A future project...
At this point I don't need smarter general models for my work, I need models that don't hallucinate, that are faster/cheaper, and that have better taste in specific domains. I think that's where we're going to see improvements moving forward.
Also, does anybody know what limits o3-pro has under the team plan? I don't see it available in the model picker at all (on team).
OpenAI dropped the price of o3 by 80%
sama's highlight[0]:
> "The plan o3 gave us was plausible, reasonable; but the plan o3 Pro gave us was specific and rooted enough that it actually changed how we are thinking about our future."
I kept nudging the team to go the whole way to just let o3 be their CEO but they didn't bite yet haha
Dangerous incentives IMO.
we are definitely not seeking to be openai sycophants, nor would they want us to be.
This announcement adds o3-pro, which pairs with o3 in the same way the o4 models go together.
It should be called o3-high, but to align with the $200 pro membership it’s called pro instead.
That said o3 is already an incredibly powerful model. I prefer it over the new Anthropic 4 models and Gemini 2.5. It’s raw power seems similar to those others, but it’s so good at inline tool use it usually comes out ahead overall.
Any non-trivial code generation/editing should be using an advanced reasoning model, or else you’re losing time fixing more glitches or missing out on better quality solutions.
Of course the caveat is cost, but there’s value on the frontier.
o4-mini-high is the label on chatgpt.com for what in the API is called o4-mini with reasoning={"effort": "high"}. Whereas o4-mini on chatgpt.com is the same thing as reasoning={"effort": "medium"} in the API.
o3 can also be run via the API with reasoning={"effort": "high"}.
o3-pro is different than o3 with high reasoning. It has a separate endpoint, and it runs for much longer.
See https://platform.openai.com/docs/guides/reasoning?api-mode=r...
Of course by now it'll be in-distribution. Time for a new benchmark...
E.g., the pelicans all look pretty cruddy including this one, but the fact that they are being delivered in .SVG is a bigger deal than the quality of the artwork itself, IMHO. This isn't a diffusion model, it's an autoregressive transformer imitating one. The wonder isn't that it's done badly, it's that it's happening at all.
The point is never the pelican. The point is that if a thing has information about pelicans, and has information about bicycles, then why can't it combine those ideas? Is it because it's not intelligent?
And in ChatGPT Pro.
There is quite a few on Google Image search.
On the other hand they still seem to struggle!
It would be interesting if there was a model that was specifically trained on task-oriented data. It's my understanding they're trained on all data available, but I wonder if it can be fine-tuned or given some kind of reinforcement learning on breaking down general tasks to specific implementations. Essentially an agent-specific model.
Nothing fancy. Visual Studio Code + Copilot, agent mode, a couple prompt files, and that's it.
Performance can improve linearly and utility can be massively jumpy. For some people/tasks performance can have improved but it'll have been "interesting but pointless" until it hits some threshold and then suddenly you can do things with it.
There was always going to be diminishing returns in these benchmarks. It's by construction. It's mathematically impossible for that not to happen. But it doesn't mean the models are getting better at a slower pace.
Benchmark space is just a proxy for what we care about, but don't confuse it for the actual destination.
If you want, you can choose to look at a different set of benchmarks like ARC-AGI-2 or Epoch and observe greater than linear improvements, and forget that these easier benchmarks exist.
"ARC-AGI-1: * Low: 44%, $1.64/task * Medium: 57%, $3.18/task * High: 59%, $4.16/task
ARC-AGI-2: * All reasoning efforts: <5%, $4-7/task
Takeaways: * o3-pro in line with o3 performance * o3's new price sets the ARC-AGI-1 Frontier"
Given the models don’t even see the versions we get to see it doesn’t surprise me they have issues we these. It’s not hard to make benchmarks that are so hard that humans and Lims can’t do.
I don't know about that. I think it's mainly because nowadays LLMs can output very inconsistent results. In some applications they can generate surprisingly good code, but during the same session they can also do missteps and shit the bed while following a prompt to small changes. For example, sometimes I still get prompt responses that outright delete critical code. I'm talking about things like asking "extract this section of your helper method into a new methid" and in response the LLM deletes the app's main function. This doesn't happen all the time, or even in the same session for the same command. How does one verify these things?
Even though it's a large10% increase first then only a 0.999% increase.
From 90% to 99% is a 10x reduction in error rate, but 99% to 99.999% is a 1000x decrease in error rates.
90% -> 1 error per 10
99% -> 1 error per 100
99.99% -> 1 error per 10,000
That can help to see the growth in accuracy, when the numbers start getting small (and why clocks are framed as 1 second lost per…).
https://www.svgviewer.dev/s/c3j6TEAP
in case anyone is interested
Have completed around a dozen chats with o3-pro so far. Can't say I'm impressed, output feels qualitatively very similar to regular o3.
Tried feeding in loads of context as suggested in the article but generally feels like a miss.
I have dreamed of having powerful AI ever since I read Bertram Raphael's great book Mind Inside Matter around 1978, getting hooked on AI research and sometimes practical applications for my life since then.
I can easily afford $200 for a Pro account but I get this nagging feeling that LLMs are not the final path to the powerful AI I have always dreamed of and I don't want to support this level of hype.
I have lived through a few AI winters and I worry that accountants will tally up the costs, environmental and money, versus the benefits and that we collectively have an 'oh shit' moment.
It created the image showing each month but when you looked at each month it was so janky ... February 31st and other huge errors!
I'm not using image creation to create 3d art for fun or art sake im trying to use it to create utility images to share for discussion with friends & co-workers. The above is just one of many ways it fails when creating utility images!
mmsc•1d ago
Osyris•1d ago
However, the "plus" plan absolutely could use some trimming.
djrj477dhsnv•1h ago
bachittle•1d ago
stavros•1d ago
rfw300•1d ago
nikcub•1d ago
CamperBob2•1d ago
Sounds like o3-pro is even slower, which is fine as long as it's better.
o4-mini-high is my usual go-to model if I need something better than the default GPT4-du jour. I don't see much point in the others and don't understand why they remain available. If o3-pro really is consistently better, it will move o1-pro into that category for me.
CuriouslyC•1d ago
macawfish•1d ago
nickysielicki•1d ago
browningstreet•1d ago
moomin•1d ago
orra•1d ago
simonw•1d ago
> how about we fix our model naming by this summer and everyone gets a few more months to make fun of us (which we very much deserve) until then?
nickysielicki•1d ago
lobsterthief•2h ago
MallocVoidstar•1d ago
kaoD•1d ago
transcriptase•1d ago
GPT-4o
o3
o4-mini
o4-mini-high
GPT-4.5
GPT-4.1
GPT-4.1-mini
koakuma-chan•1d ago
aetherspawn•1d ago
For example the other day they released a supposedly better model with a lower number..
aetherspawn•1d ago
levocardia•1d ago
paxys•1d ago
Those users go to chat.openai.com (or download the app), type text in the box and click send.
AtlasBarfed•1d ago
Port unix-sed from c to java with a full test suite and all options supported.
Somewhere between "it answers questions of life" and "it beats PhDs at math questions", I'd like to see one LLM take this, IMO, rather "pure" language task and succeeed.
It is complicated, but it isn't complex. It's string operations with a deep but not that deep expression system and flag set.
It is well-described and documented on the internet, and presumably training sets. It is succinctly described as a problem that virtually all computer coders would understand what it entailed if it were assigned to them. It is drudgerous, showing the opportunity for LLMs to show how they would improve true productivity.
GPT fails to do anything other than the most basic substitute operations. Claude was only slightly better, but to its detriment hallucinated massive amounts and made fake passing test cases that didn't even test the code.
The reaction I get to this test is ambivalence, but IMO if LLMs could help port entire software packages between languages with similar feature sets (aside from Turing Completeness), then software cross-use would explode, and maybe we could port "vulnerable" code to "safe" Rust en masse.
I get it, it's not what they are chasing customer-wise. They want to write (in n-gate terms) webcrap.
CamperBob2•1d ago
AtlasBarfed•8h ago
nipah•8h ago
The normal o3 also managed to break 3 isolated installations of linux I was trying it with, a few days ago. The task was very simple, simply setup ubuntu with btrfs, timeshift and grub-btrfs and it managed to fail every single time (even when searching the web), so it was not impressive either.
resters•1d ago
I think the naming scheme is just fine and is very straightforward to anyone who pays the slightest bit of attention.