> We’ll post to @openaidevs once the new pricing is in full effect. In $10… 9… 8…
There is also speculation that they are only dropping the input price, not the output price (which includes the reasoning tokens).
Input: $2.00 / 1M tokens
Cached input: $0.50 / 1M tokens
Output: $8.00 / 1M tokens
https://openai.com/api/pricing/
Now cheaper than gpt-4o and same price as gpt-4.1 (!).
The speculation of only input pricing being lowered was because yesterday they gave out vouchers for 1M free input tokens while output tokens were still billed.
This is where the naming choices get confusing. "Should" o3 cost more or less than GPT-4.1? Which is more capable? A generation 3 of tech intuitively feels less advanced than a 4.1 of a (similar) tech.
- o4 is reasoning
- 4o is not
They simply do not do a good job of differentiating. Unless you work directly in the field, it is likely not obvious what is the difference between "our most powerful reasoning model" and "our flagship model for complex tasks."
"Does my complex task need reasoning or not?" seems to be how one would choose. (What type of task is complex but does not require any reasoning?) This seems less than ideal!
also no idea why he thinks roo is handicapped when claude code nerfs the thinking output and requires typing “think”/think hard/think harder/ultrathink just to expand the max thinking tokens.. which on ultrathink only sets it at 32k… when the max in roo is 51200 and it’s just a setting.
From my experience (so not an ultimate truth) Claude is not so great at taking the decision for planning by its own: it dives immediately into coding.
If you ask it to think step-by-step it still doesn’t do it but Gemini 2.5 Pro is good at that planning but terrible at actual coding.
So you can use Gemini as planner and Claude as programmer and you get something decent on RooCode.
This “think wisely” that you have to repeat 10x in the prompt is absolutely true
Edit: I think I know where our miscommunication is happening...
The "think"/"ultrathink" series of magic words are a claudecode specific feature used to control the max thinking tokens in the request. For example, in claude code, saying "ultrathink" sets the max thinking tokens to 32k.
On other clients these keywords do nothing. In Roo, max thinking tokens is a setting. You can just set it to 32k, and then that's the same as saying "ultrathink" in every prompt in claudecode. But in Roo, I can also setup different settings profiles to use for each mode (with different max thinking token settings), configure the mode prompt, system prompt, etc. No magic keywords needed.. and you have full control over the request.
Claude Code doesn't expose that level of control.
I have a suspicion that's how they were able to get gpt-4-turbo so fast. In practice, I found it inferior to the original GPT-4 but the company probably benchmaxxed the hell out of the turbo and 4o versions so even though they were worse models, users found them more pleasing.
a proxy to that may be the anecdotal evidence of users who report back in a month that model X has gotten dumber (started with gpt-4 and keeps happening, esp. with Anthro and OpenAI models). I haven't heard such anecdotal stories about Gemini, R1, etc.
Some users. For me the drop was so huge it became almost unusable for the things I had used it for.
You pay monthly fee, but Gemini is completely jammed 5-6 hours when North America is working.
Google is best in pure AI research, both quality and volume. They have sucked at productization for years. Not not just AI but other products as well. Real mystery.
Now I'm feeling similarly with their image generation (which is the only reason I created a paid account two months ago, and the output looks more generic by default).
This time around it felt pretty stark. I used ChatGPT to create at most 20 different image compositions. And after a couple of good ones at first, it felt worse after. One thing I've noticed recently is that when working on vector art compositions, the results start more simplistic, and often enough look like clipart thrown together. This wasn't my experience first time around. Might be temperature tweaks, or changes in their prompt that lead to this effect. Might be some random seed data they use, who knows.
Anecdotally, it's quite clear that some models are throttled during the day (eg Claude sometimes falls back to "concise mode" - with and without a warning on the app).
You can tell if you're using Windsurf/Cursor too - there are times of the day where the models constantly fail to do tool calling, and other times they "just work" (for the same query).
Finally, there's cases where it was confirmed by the company, like Gpt-4o's sycopanth tirade that very clearly impacted its output (https://openai.com/index/sycophancy-in-gpt-4o/)
Trusting these LLM providers today is as risky as trusting Facebook as a platform, when they were pushing their “opensocial” stuff
You've also made the mistake of conflating what's served via API platforms which are meant to be stable, and frontends which have no stability guarantees, and are very much iterated on in terms of the underlying model and system prompts. The GPT-4o sycophancy debacle was only on the specific model that's served via the ChatGPT frontend and never impacted the stable snapshots on the API.
I have never seen any sort of compelling evidence that any of the large labs tinkers with their stable, versioned model releases that are served via their API platforms.
> At the time of writing, there are two major versions available for GPT-4 and GPT-3.5 through OpenAI’s API, one snapshotted in March 2023 and another in June 2023.
openaichat/gpt-3.5-turbo-0301 vs openaichat/gpt-3.5-turbo-0613, openaichat/gpt-4-0314 vs openaichat/gpt-4-0613. Two _distinct_ versions of the model, and not the _same_ model over time like how people like to complain that a model gets "nerfed" over time.
When a new model comes out I test the waters a bit with some more ambitious queries and get impressed when it can handle them reasonably well. Over time I take it for granted and then just expect it to be able to handle ever more complex queries and get dissappointed when I hit a new limit.
Which is why the base model wouldn't necessarily show differences when you benchmarked them.
There are tons of benchmarks that don't show any regressions. Even small and unpublished ones rarely show regressions.
Like I suspect if there was a "new" model which was best-of-256 sampling of gpt-3.5-turbo that too would seem like a really exciting model for the first little bit after it came out, because it could probably solve a lot of problems current top models struggle with (which people would notice immediately) while failing to do lots of things that are a breeze for top models (which would take people a little bit to notice).
Hmm, that's evidently and anecdotally wrong:
interesting take, I wouldn't be surprised if they did that.
if you could do this automatically, it would be game changer as you could run top 5 best models in parallel and select best answer every time
but it's not practical because you are the bottleneck as you have to read all 5 solutions and compare them
remember they have access to the RLHF reward model, against which they can evaluate all N outputs and have the most "rewarded" answer picked and sent
o3 is still o3 (no nerfing) and o3-pro is new and better than o3.
If we were lying about this, it would be really easy to catch us - just run evals.
(I work at OpenAI.)
If we did change the model, we'd release it as a new model with a new name in the API (e.g., o3-turbo-2025-06-10). It would be very annoying to API customers if we ever silently changed models, so we never do this [1].
[1] `chatgpt-4o-latest` being an explicit exception
It's the same logic of why UB in C/C++ isn't a license to do whatever the compiler wants. We're humans and we operate on implications, common-sense assumptions and trust.
https://cloud.google.com/products?hl=en#product-launch-stage...
"At Preview, products or features are ready for testing by customers. Preview offerings are often publicly announced, but are not necessarily feature-complete, and no SLAs or technical support commitments are provided for these. Unless stated otherwise by Google, Preview offerings are intended for use in test environments only. The average Preview stage lasts about six months."
I bet someone at Google would be a bit surprised to see someone jumping to legalese to act like this...novelty...is inherently due to the preview status, and based on anything more than a sense that there's no net harm done to us if it costs the same and is better.
I'm not sure they're wrong.
But it also leads to a sort of "nobody knows how anything works because we have 2^N configs and 5 bits" - for instance, 05-06 was also upgraded to 06-05. Except it wasn't, if you sent variable thinking to 05-06 after upgrade it'd fail. (and don't get me started on the 5 different thinking configurations for Gemini 2.5 flash thinking vs. gemini 05-06 vs. 06-05 and 0 thinking)
It's a preview model - for testing only, not for production. Really not that complicated.
Why are you in the comments section of a engineering news site?
(note: beyond your, excuse me while I'm direct now, boorish know-nothing reply, the terms you are citing have nothing to do with the thing people are actually discussing around you, despite your best efforts. It doesn't say "we might swap in a new service, congrats!", nor does it have anything to say about that. Your legalese at most describes why they'd pull 05-06, not forward 05-06 to 06-05. This is a novel idea.)
And I mean I genuinely do not understand what you are trying to say. Couldn't parse it.
Do you understand that even if it did say that, that wasn't true either? It was some weird undocumentable half-beast?
I have exactly your attitude about their cavalier use of preview for all things Gemini, and even people's use of the preview models.
But I've also been on this site for 15 years and am a bit wow'd by your interlocution style here -- it's quite rare to see someone flip "the 3P provider swapped the service on us!" into "well they said they could turn it off, of course you should expect it to be swapped for the first time ever!" insert dull sneer about the quality of other engineers
I am done with this thread. We are going around in circles.
It's a cute argument, as I noted, I'm emotionally sympathetic to it even, it's my favorite "get off my lawn." However, I've also been on the Internet long enough to know you write back, at length, when people try anti-intellectualism and why-are-we-even-talking-about-this as interaction.
"b. Disclaimer. PRE-GA OFFERINGS ARE PROVIDED “AS IS” WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES OR REPRESENTATIONS OF ANY KIND. Pre-GA Offerings (i) may be changed, suspended or discontinued at any time without prior notice to Customer and (ii) are not covered by any SLA or Google indemnity. Except as otherwise expressly indicated in a written notice or Google documentation, (A) Pre-GA Offerings are not covered by TSS, and (B) the Data Location Section above will not apply to Pre-GA Offerings."
It’s always worth considering that this may be your problem. If you still don’t get it, the only valuable reply is one which asks a question. Also, including “it’s not that complicated” only serves to inflame.
If the "preview release" you were using was v0.3, and suddenly it started being v0.6 without warning, that would be insane. The only point of providing a version number is to give people an indicator of consistency. The datestamp is a version number. If they didn't want us to expect consistency, they should not have given it a version number. That's the whole point of rolling release branches, they have no version. You don't have "v2.0" of a rolling release, you just have "latest". They fucked up by giving it a datestamp.
This is an extremely old and well-known problem with software interfaces. Either you version it or you don't. If you do version it, and change it, you change the version, and give people dependent on the old version some time to upgrade. Otherwise it breaks things, and that pisses people off. The alternative is not versioning it, which is a signal that there is no consistency to be expected. Any decent software developer should have known all this.
And while I'm at it: what's with the name flip-flopping? In 2014, GCP issued a PR release explaining It was no longer using "Preview", but "Alpha" and "Beta" (https://cloudplatform.googleblog.com/2014/10/new-release-pha...). But the link you showed earlier says "Alpha" and "Beta" are now deprecated. But no PR release? I guess that's our bad for not constantly reading the fine print and expecting it to revert back to something from 11 years ago.
Speaking of a new name. I'll donate the API credits to run a "choose a naming scheme for AI models that isn't confusing AF" for OpenAI.
https://openai.com/index/introducing-o3-and-o4-mini/
o3 scored 91.6 on AIME 2024. 83.3 on GPQA
o4-mini scored 93.4, 81.4 GPQA
Then, the new announcement
https://help.openai.com/en/articles/6825453-chatgpt-release-...
o3 scored 90 on AIME 2024, 81 GPQA
o4-mini wasn't measured
---
Codeforces is the same, but they have a footnote that they're using a different dataset due to saturation, but still have no grounding model to compare with
o3 pro is a different thing - it’s not just o3 with maximum remaining effort.
Here's the current state with version numbers as far as I can piece it together (using my best guess at naming of each component of the version identifier. Might be totally wrong tho):
1) prefix (optional): "gpt-", "chatgpt-"
2) family (required): o1, o3, o4, 4o, 3.5, 4, 4.1, 4.5,
3) quality? (optional): "nano", "mini", "pro", "turbo"
4) type (optional): "audio", "search"
5) lifecycle (optional): "preview", "latest"
6) date (optional): 2025-04-14, 2024-05-13, 1106, 0613, 0125, etc (I assume the last ones are a date without a year for 2024?)
7) size (optional): "16k"
Some final combinations of these version number components are as small as 1 ("o3") or as large as 6 ("gpt-4o-mini-search-preview-2024-12-17").
Given this mess, I can't blame people assuming that the "best" model is the one with the "biggest" number, which would rank the model families as: 4.5 (best) > 4.1 > 4 > 4o > o4 > 3.5 > o3 > o1 (worst).
As an analogy, think of it like this:
o3-low ~ Ford Mustang with the accelerator gently pressed
o3-medium ~ Ford Mustang with the accelerator pressed
o3-high ~ Ford Mustang with the accelerator heavily pressed
o3 pro ~ Ford Mustang GT
Even though a Mustang GT is a different car than a Mustang, you don’t give it a totally different name (eg Palomino). The similarity in name signals it has a lot of the same characteristics but a souped up engine. Same for o3 pro.
Fun fact: before GPT-4, we had a unified naming scheme for models that went {modality}-{size}-{version}, which resulted in names like text-davinci-002. We considered launching GPT-4 as something like text-earhart-001, but since everyone was calling it GPT-4 anyway, we abandoned that system to use the name GPT-4 that everyone had already latched onto. Kind of funny how our original unified naming scheme made room for 999 versions, but we didn't make it past 3.
Edit: When I say the Mustang GT is a different car than a Mustang - I mean it literally. If you bought a Mustang GT and someone delivered a Mustang with a different trim, you wouldn't say "great, this is just what I ordered, with the same features/behavior/value." That we call it a different trim is a linguistic choice to signal to consumers that it's very similar, and built on the same production line, but comes with a different engine or different features. Similar to o3 pro.
> As o3-pro uses the same underlying model as o3, full safety details can be found in the o3 system card.
- o3 pro is based on o3
- o3 pro uses the same underlying model as o3
- o3 pro is similar to o3, but is a distinct thing that's smarter and slower
- o3 pro is not o3 with longer reasoning
In my analogy, o3 pro vs o3 is more than just an input parameter (e.g., not just the accelerator input) but less than a full difference in model (e.g., Ford Mustang vs F150). It's in between, kind of like car trim with the same body but a stronger engine. Imperfect analogy, and I apologize if this doesn't feel like it adds any clarity. At the end of the day, it doesn't really matter how it works - what matters is if people find it worth using.
I didn't read the ToS, like everyone else, but my guess is that degrading model performance at peak times will be one of the things that can slip through. We are not suggesting you are running a different model but that you are quantizing it so that you can support more people.
This can't happen with Open weight models where you put the model, allocate the memory and run the thing. With OpenAI/Claude, we don't know the model running, how large it is, what it is running on, etc... None of that is provided and there is only one reason that I can think of: to be able to reduce resources unnoticed.
This is HN and not reddit.
"I didn't read the ToS, like everyone else, but my guess..."
Ah, there it is.
However starting from a week ago, the o3 responses became noticeably worse, with G2.5P staying about the same (in terms of what I've come to expect from the two models).
This alongside the news that you guys have decreased the price of o3 by 80% does really make it feel like you've quantized the model or knee-capped thinking or something. If you say it is wholly unchanged I'll believe you, but not sure how else to explain the (admittedly subjective) performance drop I've experienced.
But yes, perhaps the answer is that about a week ago I started asking subconsciously harder questions, and G2.5P handled them better because it had just been improved, while o3 had not so it seemed worse. Or perhaps G2.5P has always had more capacity than o3, and I wasn't asking hard enough questions to notice a difference before.
o4-mini-high o4-mini o3 o3-pro gpt-4o
Oy.
The thing that gets me is it seems to be lying about fetching a web page. It will say things are there that were never on any version of the page and it sometimes takes multiple screenshots of the page to convince it that it's wrong.
It had a few bugs here or there when they pushed updates, but it didn't get worse.
My question is not whether this is true (it is) but why it's happening.
I am willing to believe the aider community has found that Gemini has maintained approximately equivalent performance on fixed benchmarks. That's reasonable considering they probably use a/b testing on benchmarks to tell them whether training or architectural changes need to be reverted.
But all versions of aider I've tested, including the most recent one, don't handle Gemini correctly so I'm skeptical that they're the state of the art with respect to bench-marking Gemini.
For benchmarks, either Gemini writes code that adheres to the required edit format, builds successfully, and passes unit tests, or it doesn't.
I primarily use aider + 2.5 pro for planning/spec files, and occasionally have it do file edits directly. Works great, other than stopping it mid-execution once in a while.
IMO 2.5 Pro 03-25 was insanely good. I suspect it was also very expensive to run. The 05-06 release was a huge regression in quality, most people saying it was a better coder and a worse writer. They tested a few different variants and some were less bad then others, but overall it was painful to lose access to such a good model. The just released 06-05 version seems to be uniformly better than 05-06, with far fewer "wow this thing is dumb as a rock" failure modes, but it still is not as strong as the 03-25 release.
Entirely anecdotally, 06-05 seems to exactly ride the line of "good enough to be the best, but no better than that" presumably to save costs versus the OG 03-25.
In addition, Google is doing something notably different between what you get on AI Studio versus the Gemini site/app. Maybe a different system prompt. There have been a lot of anecdotal comparisons on /r/bard and I do think the AI Studio version is better.
The main leaderboard page that you linked to is updated quite frequently, but it doesn't contain multiple benchmarks for the same exact model.
They inflated expectations and then released to the public a model that underperforms
> Today, we dropped the price of OpenAI o3 by 80%, bringing the cost down to $2 / 1M input tokens and $8 / 1M output tokens.
> We optimized our inference stack that serves o3—this is the same exact model, just cheaper.
In the API, we never make silent changes to models, as that would be super annoying to API developers [1]. In ChatGPT, it's a little less clear when we update models because we don't want to bombard regular users with version numbers in the UI, but it's still not totally silent/opaque - we document all model updates in the ChatGPT release notes [2].
[1] chatgpt-4o-latest is an exception; we explicitly update this model pointer without warning.
[2] ChatGPT Release Notes document our updates to gpt-4o and other models: https://help.openai.com/en/articles/6825453-chatgpt-release-...
(I work at OpenAI.)
We don't make hobbyist mistakes of randomly YOLO trying various "quantization" methods that only happen after all training and claim it a day, at all. Quantization was done before it went live.
They probably also have cheap code or cheap models that normalize requests to increase cache hit rate.
In this case you didn’t even get the same answer, you only happened to have one sentence in the answer match.
> Regardless of whether caching is used, the output generated will be identical. This is because only the prompt itself is cached, while the actual response is computed anew each time based on the cached prompt
That's not true at all and is exactly what prompt caching is for. For one, you can at least populate the attention KV Cache, which will scale with the prompt size. It's true that if your prompt is larger than the context size, then the prompt size no longer affects inference speed since it essentially discards the excess.
My mind immediately goes to rowhammer for some reason.
At the very least this opens up the possibility of some targeted denial of service
No? Eg "how to cook pasta" is probably asked a lot.
Once new MacBooks and iPhones have enough memory onboard this is going to be a disaster for OpenAI and other providers.
If I was OpenAI (or Anthropic for that matter) I would remain scared of Google, who is now awake and able to dump Gemini 2.5 pro on the market at costs that I'm not sure people without their own hardware can compete with, and with the infrastructure to handle everyone switching to them tomorrow.
OpenAI vs Anthropic on Google Trends
https://trends.google.com/trends/explore?date=today%203-m&q=...
ChatGPT vs Claude on Google Trends
https://trends.google.com/trends/explore?date=today%203-m&q=...
Do you not think that batch inference gives at least a bit of a moat whereby unit costs fall with more prompts per unit of time, especially if models get more complicated and larger in the future?
I get the point is to be the last man standing, and poaching customers by lowering the price, and perhaps attract a few people who wouldn't have bought a subscription at the higher price. I just question how long investors can justify pouring money into OpenAI. OpenAI is also the poster child for modern AI, so if they fail the market will react badly.
Mostly I don't understand Silicon Valley venture capital, but dumping price, making wild purchases for investor money and mostly only leading on branding, why isn't this a sign that OpenAI is failing?
That seems likely to me, all of the LLM providers have been consistently finding new optimizations for the past couple of years.
right now new Gemini surpassed their o3 (barely) in benchmarks for significantly less money so they cut pricing to be still competitive.
I bet they didn't released o4 not because it's not competitive, but because they are doing Nvidia game: release new product that is just enough better to convince people to buy it. so IMO they are holding full o4 model to have something to release after competition release something better that their top horse
Many many companies are currently thrilled to pay the current model prices for no performance improvement for 2-3 years
We still have so many features to build on top of current capabilities
They need lots of energy and customers don’t pay much, if they pay at all
The developers of AI models do have a moat, the cost of training the model in the first place.
It's 90% of the low effort AI wrappers with little to no value add who have no moat.
Or is the price drop an attempt to cover up bad news about the outage with news about the price drop?
This makes no sense. No way a global outage will get less coverage than the price drop.
Also the earliest sign of price drop is this tweet 20 hrs ago (https://x.com/OpenAIDevs/status/1932248668469445002), which is earlier than the earliest outage reports 13hrs ago on https://downdetector.com/status/openai/
Have you seen today's outage on any news outlet? I have not. Is there an HN thread?
That said, I'm absolutely willing to hear people out on "value-adds" I am missing out on; I'm not a knee-jerk hater (For context, I work with large, complex & private databases/platforms, so its not really possible for me to do anything but ask for scripting suggestions).
Also, I am 100% expecting a sad day when I'll be forced to subscribe, unless I want to read dick pill ads shoehorned in to the answers (looking at you, YouTube). I do worry about getting dependent on this tool and watching it become enshittified.
Just switch to a competitors free offering. There are enough to cycle through not to be hindered by limits. I wonder how much money I have cost those companies by now?
How anyone believes there is any moat for anyone here is beyond me.
I've never used anything like it. I think new Claude is similarly capable
I think if you're goal is to have properly written langauge using older writing styles, then you're correct.
The fact that these tend to be written in an older writing style is to me incidental. You could rewrite all your college text books in contemporary social media slang and I would still consider them high-quality texts.
In my experience, o4-mini and o4-mini-high are far behind o3 in utility, but since I’m rate-limited for the latter, I end up primarily using the former, which has kind of reinforced the perception that OpenAI’s thinking models are behind the competition altogether.
It's great with those, however!
Just yesterday, they reported an annualized revenue run rate of 10B. Their last funding round in March valued them at 300B. Despite losing 5B last year, they are growing really fast - 30x revenue with over 500M active users.
It reminds me a lot of Uber in its earlier years—fast growth, heavy investment, but edging closer to profitability.
For OpenAI, the more people use the product, the same you spend on compute unless they can supplement it with another ways of generating revenue.
I dont unfortunately think OpenAI will be able to hit sustained profitability (see Netflix for another example)
Netflix has been profitable for over a decade though? They reported $8.7 billion in profit in 2024.
What? Netflix is incredibly profitable.
Obviously, lots of nerds on HN have preferences for Gemini and Claude, and having used all three I completely get why that is. But we should remember we're not representative of the whole addressable market. There were probably nerds on like ancient dial-up bulletin boards explaining why Betamax was going to win, too.
Again: I don't know. I've got no predictions. I'm just saying that the logic where OpenAI is outcompeted on models themselves and thus automatically lose does not hold automatically.
Similarly, nearly all AI products but especially OpenAI are heavily _under_ monetized. OpenAI is an excellent personal shopper - the ad revenue that could be generated from that rivals Facebook or Google.
You could override its suggestions with paid ones, or nerf the bot's shopping abilities so it doesn't overshadow the sponsors, but that will destroy trust in the product in a very competitive industry.
You could put user-targeted ads on the site not necessarily related to the current query, like ads you would see on Facebook, but if the bot is really such a good personal shopper, people are literally at a ChatGPT prompt when they see the ads and will use it to comparison shop.
(with many potential variants)
I'd say dropping the price of o3 by 80% due to "engineers optimizing inferencing" is a strong sign that they're doing exactly that.
This is marginally less true for embedding models and things you've fine-tuned, but only marginally.
- Carefully interleaving shared memory loading with computation, and the whole kernel with global memory loading.
- Warp shuffling for softmax.
- Avoiding memory access conflicts in matrix multiplication.
I'm sure the guys at ClosedAI have many more optimizations they've implemented ;). They're probably eventually going to design their own chips or use photonic chips for lower energy costs, but there's still a lot of gains to be made in the software.
Optimizing serving isn't unlikely: all of the big AI vendors keep finding new efficiencies, it's been an ongoing trend over the past two years.
They finally implemented DeepSeek open source methods for fast inference?
With the race to get new models out the door, I doubt any of these companies have done much to optimize cost so far. Google is a partial exception – they began developing the TPU ten years ago and the rest of their infrastructure has been optimized over the years to serve computationally expensive products (search, gmail, youtube, etc.).
The more inference customers OpenAI has, the easier it is for them to reach profitability.
* We have people uploading tons of zero-effort slop pieces to all manner of online storefronts, and making people less likely to buy overall because they assume everything is AI now
* We have an uncomfortable community of, to be blunt, actual cultists emerging around ChatGPT, doing all kinds of shit from annoying their friends and family all the way up to divorcing their spouses
* Education is struggling in all kinds of ways due to students using (and abusing) the tech, with already strained administrations struggling to figure out how to navigate it
Like yeah if your only metric is OpenAI's particular line going up, it's looking alright. And much like Uber, it's success seems to be corrosive to the society in which it operates. Is this supposed to be good news?
A great communicator on the risks of AI being to heavily intergrated into society is Zak Stein. As someone who works in education, they are see first hand how people are becoming dependent on this stuff rather than any kind of self improvement. The people who are just handing over all their thinking to the machine. It is very bizarre and I am seeing it in my personal experience a lot more over the last few months.
Plus there is the thing that "thinking models" can't really solve complex tasks / aren't really as good as they are believed to be .
OpenAI is very good at this as well because of their brand name. For many people ChatGPT is all they know. That's the one that's in the news. That's the one everybody keeps talking about. They have many millions of paying users at this point.
This is a non trivial moat. If you can only be successful by not serving most of the market for cost reasons, then you can't be successful. It's how Google has been able to guard its search empire for a quarter century. It's easy to match what they do algorithmically. But then growing from a niche search engine that has maybe a few tens of thousands of users (e.g. Kagi) to Google scale serving essentially most of this planet (minus some fire walled countries like Russia and China), is a bit of a journey.
So Google rolling out search integration is a big deal. It means they are readying themselves for that scale and will have billions of users exposed to this soon.
> Their last funding round in March valued them at 300B. Despite losing 5B last year, they are growing really fast
Yes, they are valued based on world+dog needing agentic AIs and subscribing to the extent of tens or hundreds of dollars/month. It's going to outstrip revenue things like MS Office in its prime.
5B loss is peanuts compared to that. If they weren't burning that, their ambition level would be too low.
Uber now has a substantial portion of the month. They have about 3-4 billion revenue per month. A lot of cost obviously. But they managed 10B profit last year. And they are not done growing yet. They were overvalued at some point and then they crashed, but they are still there and it's a pretty healthy business at this point and that reflects in their stock price. It's basically valued higher now than at the time of the Softbank investment pre-IPO. Of course a lot of stuff needed to be sorted out for that to happen.
They’re not letting the competition breathe
(It's "contact us" pricing, so I have no idea how much that would set you back. I'm guessing it's not cheap.)
I don't see this happening with for example deepseek.
Is it possible they are saving on resources by having it answer that way?
When I worked at Netflix I sometimes heard the same speculation about intentionally bad recommendations, which people theorized would lower streaming and increase profit margins. It made even less sense there as streaming costs are usually less than a penny. In reality, it’s just hard to make perfect products!
(I work at OpenAI.)
Example, I asked it to write something. And then I asked it to give me that blob of text in markdown format. So everything it needed was already in the conversation. That took a whole minute of doing web searches and what not.
I actually dislike using o3 for this reason. I keep the default to 4o. But sometimes I forget to switch back and it goes off boiling the oceans to answer a simple question. It's a bit too trigger happy with that. In general all this version and model soup is impossible to figure out for non technical users. And I noticed 4o is now sometimes starting to do the same. I guess, too many users never use the model drop down.
takes tinfoil hat off
Oh, nvm, that makes sense.
I am not saying they haven't improved the laziness problem, but it does happen anecdotally. I even got similar sort of "lazy" responses for something I am building with gemini-2.5-flash.
How exactly will passport check prevent any training?
At most this will block API access to your average Ivan, not a state actor
It generally does not. No idea if there are edge cases where it does, but that's definitely not the norm for the average user.
https://community.openai.com/t/session-expired-verify-organi...
https://community.openai.com/t/callback-from-persona-id-chec...
https://community.openai.com/t/verification-issue-on-second-...
https://community.openai.com/t/verification-not-working-and-...
https://community.openai.com/t/organization-verfication-fail...
https://community.openai.com/t/help-organization-could-not-b...
https://community.openai.com/t/to-verify-an-organization-acc...
Yesterday: Today
------------- -------------
Price Price
Input: Input:
$10.00 / 1M tokens $2.00 / 1M tokens
Cached input: Cached input:
$2.50 / 1M tokens $0.50 / 1M tokens
Output: Output:
$40.00 / 1M tokens $8.00 / 1M tokens
https://archive.is/20250610154009/https://openai.com/api/pri...First, I tried enabling o3 via OpenRouter since I have credits with them already. I was met with the following:
"OpenAI requires bringing your own API key to use o3 over the API. Set up here: https://openrouter.ai/settings/integrations"
So I decided I would buy some API credits with my OpenAI account. I ponied up $20 and started Aider with my new API key set and o3 as the model. I get the following after sending a request:
"litellm.NotFoundError: OpenAIException - Your organization must be verified to use the model `o3`. Please go to: https://platform.openai.com/settings/organization/general and click on Verify Organization. If you just verified, it can take up to 15 minutes for access to propagate."
At that point, the frustration was beginning to creep in. I returned to OpenAI and clicked on "Verify Organization". It turns out, "Verify Organization" actually means "Verify Personal Identity With Third Party" because I was given the following:
"To verify this organization, you’ll need to complete an identity check using our partner Persona."
Sigh I click "Start ID Check" and it opens a new tab for their "partner" Persona. The initial fine print says:
"By filling the checkbox below, you consent to Persona, OpenAI’s vendor, collecting, using, and utilizing its service providers to process your biometric information to verify your identity, identify fraud, and conduct quality assurance for Persona’s platform in accordance with its Privacy Policy and OpenAI’s privacy policy. Your biometric information will be stored for no more than 1 year."
OK, so now, we've gone from "I guess I'll give OpenAI a few bucks for API access" to "I need to verify my organization" to "There's no way in hell I'm agreeing to provide biometric data to a 3rd party I've never heard of that's a 'partner' of the largest AI company and Worldcoin founder. How do I get my $20 back?"
[1] https://techcrunch.com/2025/04/13/access-to-future-ai-models...
which, after using it, fair! It found a zero day
- generating synthetic data to train their own models
- hacking and exploitation research
etc
Source?
Link: https://sean.heelan.io/2025/05/22/how-i-used-o3-to-find-cve-...
[0] https://www.aisi.gov.uk/work/replibench-measuring-autonomous...
It's also the only LLM provider which has this.
What OpenAI has that the others don't is SamA's insatiable thirst for everyone's biometric data.
Agreeing to Persona’s terms, especially for biometric identity verification, involves both privacy and long-term data security risks. Here’s a clear breakdown of the main risks you should be aware of: 1. Biometric Data Collection
Risk: Biometric identifiers (like facial recognition, voiceprints, etc.) are extremely sensitive and irreplaceable if compromised.
What they collect: Persona may collect a selfie, video, and metadata, and extract biometric templates from those for facial comparison and liveness detection.
If leaked or abused: Unlike passwords, you can't change your face. A future data breach or misuse could lead to permanent identity compromise.
2. Data Storage & Retention
Risk: Persona says biometric data is kept for up to one year, but: You’re relying on their internal policies, not a legal guarantee.
There’s no technical detail on how securely it’s stored or whether it’s encrypted at rest.
Worst-case scenario: Poorly secured biometric templates could be stolen, reused, or matched against other data sets by bad actors or governments.
3. Third-Party Sharing and Surveillance Risks
Risk: Your biometric and ID data may be shared with subprocessors (partners/vendors) that you haven’t explicitly vetted. Persona may transfer your data to cloud providers (like AWS, GCP), verification specialists, or fraud prevention services.
Depending on jurisdiction, data could be subject to subpoenas, surveillance laws, or government backdoors (especially in the U.S.).
4. Consent Ambiguity & Future Use
Risk: The fine print often includes vague consent for "quality assurance", "model improvement", or "fraud detection". This opens the door to retraining algorithms on your biometric data—even if anonymized, that's still a use of your body as data.
Their privacy policy may evolve, and new uses of your data could be added later unless you opt out (which may not always be possible).
Should You Agree?Only if:
You absolutely need the service that requires this verification.
You’re aware of the privacy tradeoff and are okay with it.
You trust that Persona and its partners won’t misuse your biometric data—even a year down the line.
If you’re uneasy about this, you’re not alone. Many developers and privacy advocates refuse to verify with biometrics for non-critical services, and companies like OpenAI are increasingly facing criticism for requiring this.The AG office followed up and I got my refund. Worth my time to file because we should stop letting companies get away with this stuff where they show up with more requirements after paying.
Separately they also do not need my phone number after having my name, address and credit card.
Has anyone got info on why they are taking everyone’s phone number?
>> Wednesday's 9th Circuit decision grew out of revelations that between 2013 and 2019, X mistakenly incorporated users' email addresses and phone numbers into an ad platform that allows companies to use their own marketing lists to target ads on the social platform.
>> In 2022, the Federal Trade Commission fined X $150 million over the privacy gaffe.
>> That same year, Washington resident Glen Morgan brought a class-action complaint against the company. He alleged that the ad-targeting glitch violated a Washington law prohibiting anyone from using “fraudulent, deceptive, or false means” to obtain telephone records of state residents.
>> X urged Dimke to dismiss Morgan's complaint for several reasons. Among other arguments, the company argued merely obtaining a user's phone number from him or her doesn't violate the state pretexting law, which refers to telephone “records.”
>> “If the legislature meant for 'telephone record' to include something as basic as the user’s own number, it surely would have said as much,” X argued in a written motion.
To me the obvious example is fraud/abuse protection.
I’m pretty sure you do. Claude too. The only chatbot company I’ve made an account with is Mistral specifically because a phone number was not a registration requirement.
Also Netflix wasn't initially selling ads and there you have after increasing the price of their plans drastically in the last few years the ad supported subscription is probably the #1 plans because most people aren't willing to shed 15 to 25usd/€ every month to watch content that is already littered with ads.
So, at the end of your day, company X has an overdetailed profile of you, rather than each advertiser. (And also, at least in the US, can repackage and sell that data into various products if it chooses)
When I signed up I had to do exactly that.
They may still buy data from ad companies and store credit cards, etc.
Many of them link users based on phone number.
ChatGPT has the capacity to modify behavior more subtly than any advertising ever devised. Aggregating knowledge on the person on the other end of the line is key in knowing how to nudge them toward the target behavior. (Note this target behavior may be how to vote in an election, or how to feel about various hot topics.)
It also, as Google learned, enables you to increase your revenue per placement. Advertisers will pay more for placement with their desired audience.
> To me the obvious example is fraud/abuse protection.
Phones are notorious for spam...Seriously. How can the most prolific means of spam be used to prevent fraud and abuse? (Okay, maybe email is a little more prolific?) Like have you never received a spam call or text? Obviously fraudsters and abusers know how to exploit those systems... it can't be more obvious...
What would you do instead?
I would think in a world where we constantly get spam calls and texts that people would understand that a phone number is not a good PKI. I mean we literally don't answer calls from unknown numbers because of this. How is it that we can only look at these things in one direction but not the other?
Phone number is the only way to reliably stop MOST abuse on a freemium product that doesn't require payment/identity verification upfront. You can easily block VOIP numbers and ensure the person connected to this number is paying for an actual phone plan, which cuts down dramatically on bogus accounts.
Hence why even Facebook requires a unique, non-VOIP phone number to create an account these days.
I'm sure this comment will get downvoted in favor of some other conspiratorial "because they're going to secretly sell my data!" tinfoil post (this is HN of course). But my explanation is the actual reason.
I would love if I could just use email to signup for free accounts everywhere still, but it's just too easily gamed at scale.
- Parent talks about a paid product. If they wants to burn tokens, they are going to pay for it.
- Those phone requirements do not stop professional abusers, organized crime nor state sponsored groups. Case in point: twitter is overrun by bots, scammers and foreign info-ops swarms.
- Phone requirements might hinder non-professional abusers at best, but we are sidestepping the issue if those corporations deserve that much trust to compel regular users to sell themselves. Maybe the business model just sucks.
Also, if they don't do freemium they're getting way more valuable information about you than just a phone number.
Only requiring the phone number for API users feels needlessly invasive and is not explained by a vague "countering fraud and abuse" for a paid product...
Your explanation is inconsistent with the link in these comments showing Twitter getting fined for doing the opposite.
> Hence why even Facebook requires a unique, non-VOIP phone number to create an account these days.
Facebook is the company most known for disingenuous tracking schemes. They just got caught with their app running a service on localhost to provide tracking IDs to random shady third party websites.
> You can easily block VOIP numbers and ensure the person connected to this number is paying for an actual phone plan, which cuts down dramatically on bogus accounts.
There isn't any such thing as a "VOIP number", all phone numbers are phone numbers. There are only some profiteers claiming they can tell you that in exchange for money. Between MVNOs, small carriers, forwarding services, number portability, data inaccuracy and foreign users, those databases are practically random number generators with massive false positive rates.
Meanwhile major carriers are more than happy to give phone numbers in their ranges to spammers in bulk, to the point that this is now acting as a profit center for the spammers and allowing them to expand their spamming operations because they can get a large number of phone numbers those services claim aren't "VOIP numbers", use them for spamming the services they want to spam, and then sell cheap or ad-supported SMS service at a profit to other spammers or privacy-conscious people who want to sign up for a service they haven't used that number at yet.
Google tried this with Google Plus and Google Wave, failed spectacularly, and have ironically stopped with this idiotic "marketing by blocking potential users". I can access Gemini Pro 2.5 without providing a blood sample or signing parchment in triplicate.
[1] Not really though, because a significant percentage of OpenAI's revenue is from spammers and bulk-generation of SOE-optimised garbage. Those are valued customers!
Difficulty: Impossible
Maybe you’re thinking of deep research mode which is web UI only for now.
Anthropic exposes reasoning, which has become a big reason to use them for reasoning tasks over the other two despite their pricing. Rather ironic when the other two have been pushing reasoning much harder.
[1] https://ai.google.dev/gemini-api/docs/thinking#summaries [2] https://discuss.ai.google.dev/t/massive-regression-detailed-...
Seems familiar…
[1] https://www.forbes.com/advisor/investing/cryptocurrency/what...
> I've never heard of that's a 'partner' of the largest AI company and Worldcoin founder
openai persona verification site:community[.]openai[.]com
e.g. a thread with 36 posts beginning Apr 13:
"OpenAI Non-Announcement: Requiring identity card verification for access to new API models and capabilities"
But always good to be on look out for shenanigans :)
(So I’m remaining locked out of my linkedin account.)
Really a shame OpenAI left their non-profit (and open) roots, could have been something different but nope, the machine ate them whole.
I never heard of OpenRouter prior to this thread, but will now never use them and advocate they never be used either.
Enlightened self-interest is when you realize that you win by being good to your customers, instead of treating customer service like a zero-sum game.
Contact support and ask for a refund. Then a charge back.
https://withpersona.com/legal/privacy-policy
To me it looks like an extremely aggressive data pump.
Which kind of would make the entire “discussion” moot and pointless
Just send them a random passport photo from the Internet, what's the deal? Probably they are just vibe-verifying the photo with "Is it legit passport?" prompt anyways.
this is absurd, how do they define "person"? On the internet I can be another person from another country in a minute, another minute I will be a different person from a different country.
Since Mossad and CIA is essentially one organization they already do it, 100%.
This should be illegal. How many are going to do the same as you, but then think that the effort/time/hassle they would waste to try to get their money back would not be worth it? At which point you've effectively donated money to a corp that implements anti-consumer anti-patterns.
Hello Human Resource, we have all your data, please upload your bio-metric identity, as well as your personal thoughts.
Building the next phase of a corporate totalitarian state, thank you for your cooperation.
Although I've always wondered how OpenAI could get away with o3's astronomical pricing, what does o3 do better than any other model to justify their premium cost?
What? 3 ou of 4 companies I consulted for that started using AI for coding marked cost as an important criteria. The 4th one has virtually infinite funding so they just don't care.
And those aren't average customers.
On twitter, some people say that some models perform better at night when there is a less demand which allows them to serve a non-quantized model.
Since the models are only available through API and there is no test to check which version of the model is served, it's hard to know what we're buying...
minimaxir•1d ago
I wonder if "we quantized it lol" would classify as false advertising for modern LLMs.
tofof•1d ago
drexlspivey•1d ago
vitaflo•1d ago