I'd imagine this must be a big leg up on Anthropic to warrant the "GPT-5" name?
https://epoch.ai/gradient-updates/how-much-energy-does-chatg...
Edit: Scrolling down: "one second of H100-time per query, 1500 watts per H100, and a 70% factor for power utilization gets us 1050 watt-seconds of energy", which is how they get down to 0.3 = 1050/60/60.
OK, so if they run if for a full hour it's 1050*60*60 = 3.8 MW? That can't be right.
Edit Edit: Wait, no, it's just 1050 Watt Hours, right (though let's be honest, the 70% power utilization is a bit goofy - the power is still used)? So it's 3x the power to solve the same question?
It's the same as 4G vs 5G. They have a technical definition, but it's all about marketing.
The best part is, this is not even the real definition of "AGI" yet (whatever that means at this point).
More like 10% of the capability that was promised and already the flow of capital from the inflated salaries of the past decade are going to the top AI researchers.
So sorry about that.
Official OpenAI gpt-5 coding examples repo: https://github.com/openai/gpt-5-coding-examples (https://news.ycombinator.com/item?id=44826439)
Github leak: https://news.ycombinator.com/item?id=44826439
Will be interesting to see what pushing it harder does – what the new ceiling is. 88% on aider polyglot is pretty good!
A more useful demonstration like making large meaningful changes to a large complicated codebase would be much harder to evaluate since you need to be familiar with the existing system to evaluate the quality of the transformation.
Would be kinda cool to instead see diffs of nontrivial patches to the Ruby on Rails codebase or something.
This seems to impress the mgmt types a lot, e.g. "I made a WHOLE APP!", when basically what most of this is is frameworks and tech that had crappy bootstrapping to begin with (React and JS are rife with this, in spite of their popularity).
I recently used OpenAI models to generate OCaml code, and it was eye opening how much even reasoning models are still just copy and paste machines. The code was full of syntax errors, and they clearly lacked a basic understanding of what functions are in the stdlib vs those from popular (in OCaml terms) libraries.
Maybe GPT-5 is the great leap and I'll have to eat my words, but this experience really made me more pessimistic about AI's potential and the future of programming in general. I'm hoping that in 10 years niche languages are still a thing, and the world doesn't converge toward writing everything in JS just because AIs make it easier to work with.
Isn't that the rub though? It's not an ex nihlo "intelligence", it's whatever stuff it's trained on and can derive completions from.
Maybe I spend too much time rage baiting myself reading X threads and that's why I feel the need to emphasize that AI isn't what they make it out to be.
You don't need more than JS for that.
Agreed. The models break down on not even that complex of code either, if it's not web/javascript. Was playing with Gemini CLI the other day and had it try to make a simple Avalonia GUI app in C#/.NET, kept going around in circles and couldn't even get a basic starter project to build so I can imagine how much it'd struggle with OCaml or other more "obscure" languages.
This makes the tech even less useful where it'd be most helpful - on internal, legacy codebases, enterprisey stuff, stacks that don't have numerous examples on github to train from.
Or anything that breaks the norm really.
I recently wrote something where I updated a variable using atomic primitives. Because it was inside a hot path I read the value without using atomics as it was okay for the value to be stale. I handed it the code because I had a question about something unrelated and it wouldn't stop changing this piece of code to use atomic reads. Even when I prompted it not to change the code or explained why this was fine it wouldn't stop.
While what you were doing may have been fine given your context, if you're targeting e.g. standard C++, you really shouldn't be doing it (it's UB). You can usually get the same result with relaxed atomic load/store.
(As far as AI is concerned, I do agree that the model should just have followed your direction though.)
"This repository contains a curated collection of demo applications generated entirely in a single GPT-5 prompt, without writing any code by hand."
https://github.com/openai/gpt-5-coding-examples
This is promising!
yikes - the poor executive leadership’s fragile egos cannot take the criticism.
In practice, it's very clear to me that the most important value in writing software with an LLM isn't it's ability to one-shot hard problems, but rather it's ability to effectively manage complex context. There are no good evals for this kind of problem, but that's what I'm keenly interested in understanding. Show me GPT-5 can move through 10 steps in a list of tasks without completely losing the objective by the end.
It would be trivial to over-fit, if that was their goal.
But why would there be a large number of good SVG images of pelicans on bikes? Especially relative to all the things we actually want them to generalise over?
Surely most of the SVG images of pelicans on bikes are, right now, going to be "look at this rubbish AI output"? (Which may or may not be followed by a comment linking to that artist who got humans to draw bikes and oh boy were those humans wildly bad at drawing bikes, so an AI learning to draw SVGs from those bitmap pictures would likely also still suck…)
edit: YouTube has a few English "watch party" streams, although there too, the Spanish ones have many times more viewers.
Especially Google IO, each year is different, it seems purpose built?
Livestream link: https://www.youtube.com/live/0Uu_VJeVVfo
Research blog post: https://openai.com/index/introducing-gpt-5/
Developer blog post: https://openai.com/index/introducing-gpt-5-for-developers
API Docs: https://platform.openai.com/docs/guides/latest-model
Note the free form function calling documentation: https://platform.openai.com/docs/guides/function-calling#con...
GPT5 prompting guide: https://cookbook.openai.com/examples/gpt-5/gpt-5_prompting_g...
GPT5 new params and tools: https://cookbook.openai.com/examples/gpt-5/gpt-5_new_params_...
GPT5 frontend cookbook: https://cookbook.openai.com/examples/gpt-5/gpt-5_frontend
prompt migrator/optimizor https://platform.openai.com/chat/edit?optimize=true
Enterprise blog post: https://openai.com/index/gpt-5-new-era-of-work
System Card: https://openai.com/index/gpt-5-system-card/
What would you say if you could talk to a future OpenAI model? https://progress.openai.com/
coding examples: https://github.com/openai/gpt-5-coding-examples
edit:
livestream here: https://www.youtube.com/live/0Uu_VJeVVfo
basically in my testing really felt that gpt5 was "using tools to think" rather than just "using tools". it gets very powerful when coding long horizon tasks (a separate post i'm publishing later).
to give one substantive example, in my developer beta (they will release the video in a bit) i put it to a task that claude code had been stuck on for the last week - same prompts - and it just added logging to instrument some of the failures that we were seeing and - from the logs that it added and asked me to rerun - figured out the solve.
> It’s actually worse at writing than GPT-4.5
Sounds like we need to wait a bit for the dust to settle before one can trust anything one hears/reads :)
It's hard to make a man understand something standing between them and their salary
I found it strange that, despite my excitement for such an event being roughly equivalent to WWDC these days, I had 0 desire to watch the live stream for exactly this reason: it’s not like they’re going to give anything to us straight.
Even this years WWDC I at least skipped through video afterwards. Before I used to have watch parties. Yes they’re overly positive and paint everything in a good light, but they never felt… idk whatever the vibe is I get from these (applicable to OpenAI, Grok, Meta, etc)
It’s been just a few years of a revolutionary technology and already the livestreams are less appealing than the biggest corporations yearly events. Personally I find that sad
“It’s actually worse at writing than GPT-4.5, and I think even 4o”
So the review is not consistent with the PR, hence the commenter expressing preference for outside sources.
I find coding to be harder to benchmark because there are so many ways to write the same solution. A "correct" solution may be terrible in another context due to loss of speed, security, etc.
1)Internal Retrieval
2)Web Search
3)Code Interpreter
4)Actions
How did you come up with this idea?
Sorry, but this sounds like overly sensational marketing speak and just leaves a bad taste in the mouth for me.
Then I noticed the date on the comment: 2023.
Technically, every advancement in the space is “the closest to AGI that we’ve ever been”. It’s technically correct, since we’re not moving backward. It’s just not a very meaningful statement.
By that standard Neolithic tool use was progress to AGI.
In the words OpenAI: “AGI is defined as highly autonomous systems that outperform humans at most economically valuable work”
>"While I never use AI for personal writing (because I have a strong belief in writing to think)"
The optimal AI productivity process is starting to look like:
AI Generates > Human Validates > Loop
Yet cognitive generation is how humans learn and develop cognitive strength, as well as how they maintain such strength.
Similar to how physical activity is how muscles/bone density/etc grow, and how body tissues maintain.
Physical technology freed us from hard physical labor that kept our bodies in shape -- at a cost of physical atrophy.
AI seems to have a similar effect for our minds. AI will accelerate our cognitive productivity, and allow for cognitive convenience -- at a cost of cognitive atrophy.
At present we must be intentional about building/maintaining physical strength (dedicated strength training, cardio, etc).
Soon we will need to be intentional about building/maintaining cognitive strength.
I suspect the workday/week of the future will be split on AI-on-a-leash work for optimal productivity, with carve-outs for dedicated AI-enhanced-learning solely for building/maintaining cognitive health (where productivity is not the goal, building/maintaining cognition is). Similar to how we carve out time for working out.
What are your thoughts on this? Based on what you wrote above, it seems you have similar feelings?
Is there a name for this theory?
If not can you coin one? You're great at that :)
Academic benchmark score improves only 5% but they make the bar 50% higher.
Like what? Deepseek?
How is it uninteresting? Open AI had revenue of $12B last year without monetizing literally hundreds of millions of free users in any way whatsoever (not even ads).
Microsoft's cloud revenue has exploded in the last few years off the back of AI model services. Let's not even get into the other players.
100B in economic impact is more than achievable with the technology we have today right now. That half is the interesting part.
And it could have been $1T for all anyone cares. The impact was delivered by humans. This is about impact delivered by AGI.
If you use GPT-N substantially in your work, then saying that impact rests solely on you is nonsensical.
But not at the "hand" of AGI. Perhaps you forgot to read your very own definition? Notably the "autonomous" part.
When AGI is set free and starts up "Closed I", generating $12B in economic value without humans steering the wheel, we will be (well, I will be, at least!) throughly impressed. But Microsoft won't be. They won't consider it AGI until it does $100B.
> If you use GPT-N substantially in your work, then saying that impact rests solely on you is nonsensical.
And if you use a hammer substantially in your work to generate $100B in value, a hammer is AGI according to you? You can hold that idea, but that's not what anyone else is talking about. The primary indicator of AGI, as you even said yourself, is autonomy.
“A highly autonomous system that outperforms humans at most economically valuable work.” is what's in their charter.
$100B in profits is a separate agreement with Microsoft that makes no mention of autonomity.
>And if you use a hammer substantially in your work to generate $100B in value, a hammer is AGI according to you? You can hold that idea, but that's not what anyone else is talking about. The primary indicator of AGI, as you even said yourself, is autonomy.
The primary indicator of AGI is whatever you want it to be. The words themselves make no promises of autonomity, simply an intelligence general in nature. We are simply discussing Open AI's definitions.
Again, autonomy is implied when talking about AGI. OpenAI selling tools like GPT or dishwashers, even if they were to provide the $100B in economic impact, would not satisfy the agreement. It is specifically about AGI, and there should be no confusion about what AGI is here as you helpfully defined it for us.
And PhDs are not very smart imho (I am one)
1. I desperately want (especially from Google)
2. Is impossible, because it will be super gamed, to the detriment of actually building flexible flows.
Not much explanation yet why GPT-5 warrants a major version bump. As usual, the model (and potentially OpenAI as a whole) will depend on output vibe checks.
Exactly. Too many videos - too little real data / benchmarks on the page. Will wait for vibe check from simonw and others
https://openai.com/gpt-5/?video=1108156668
2:40 "I do like how the pelican's feet are on the pedals." "That's a rare detail that most of the other models I've tried this on have missed."
4:12 "The bicycle was flawless."
5:30 Re generating documentation: "It nailed it. It gave me the exact information I needed. It gave me full architectural overview. It was clearly very good at consuming a quarter million tokens of rust." "My trust issues are beginning to fall away"
Edit: ohh he has blog post now: https://news.ycombinator.com/item?id=44828264
GPT-5 pricing: $10/Mtok out
What am I missing?
I'm not sure when they slashed the o3 pricing, but the GPT-5 pricing looks like they set it to be identical to Gemini 2.5 Pro.
If you scroll down on this page you can see what different models cost when 2.5 Pro was released: https://deepmind.google/models/gemini/pro/
> you should get another wheated bourbon like Maker's Mark French oaked
I agree. I've found Maker Mark products to be a great bang for your buck quality wise and flavor wise as well.
> I think the bourbon "market" kind of popped recently
It def did. The overproduction that was invested in during the peak of the COVID collector boom is coming into markets now. I think we'll see some well priced age stated products in the next 3-4 years based on by acquaintances in the space.
Ofc, the elephant in the room is consolidation - everyone wants to copy the LVMH model (and they say Europeans are ethical elves who never use underhanded mopolistic and market making behavior to corner markets /s).
(Not to undermine progress in the foundational model space, but there is a lack of appreciation for the democratization of domain specific models amongst HNers).
The room is the limiting factor in most speaker setups. The worse the room, the sooner you hit diminishing returns for upgrading any other part of the system.
In a fantastic room a $50 speaker will be nowhere near 95% of the performance of a mastering monitor, no matter how much EQ you put on it. In the average living room with less than ideal speaker and listening position placement there will still be a difference, but it will be much less apparent due to the limitations of the listening environment.
You might lose headroom or have to live with higher latency but if your complaint is about actual empirical data like frequency response or phase, that can be corrected digitally.
DSP is a very powerful tool that can make terrible speakers and headphones sound great, but it's not magic.
Pretty par for course evals at launch setup.
How is this sustainable.
Not that it makes it useless, just that we seem to not "be there" yet for the standard tasks software engineers do every day.
- they are only evals
- this is mostly positioned as a general consumer product, they might have better stuff for us nerds in hand.
If you email us at hn@ycombinator.com and tell us who you want to contact, we might be able to email them and ask if they would be willing to have you contact them. No guarantees though!
It's a perfect situation for Nvidia. You can see that after months of trying to squeeze out all % of marginal improvements, sama and co decided to brand this GPT-4.0.0.1 version as GPT-5. This is all happening on NVDA hardware, and they are gonna continue desperately iterating on tiny model efficiencies until all these valuation $$$ sweet sweet VC cash run out (most of it directly or indirectly going to NVDA).
To tell a made-up anecdote: A colleague told me how his professor friend was running statistical models over night because the code was extremely unoptimized and needed 6+ hours to compute. He helped streamline the code and took it down to 30 minutes, which meant the professor could run it before breakfast instead.
We are completely fine with giving a task to a Junior Dev for a couple of days and see what happens. Now we love the quick feedback of running Claude Max for a hundred bucks, but if we could run it for a buck over night? Would be quite fine for me as well.
It is easier to get from 0% accurate to 99% accurate, than it is to get from 99% accurate to 99.9% accurate.
This is like the classic 9s problem in SRE. Each nine is exponentially more difficult.
How easy do we really think it will be for an LLM to get 100% accurate at physics, when we don't even know what 100% right is, and it's theoretically possible it's not even physically possible?
I think the actual effect of releasing more models every month has been to confuse people that progress is actually happening. Despite claims of exponentially improved performance and the ability to replace PhDs, doctors, and lawyers, it still routinely can't be trusted the same as the original ChatGPT, despite years of effort.
to the point on hallucination - that's just the nature of LLMs (and humans to some extent). without new architectures or fact checking world models in place i don't think that problem will be solved anytime soon. but it seems gpt-5 main selling point is they somehow reduced the hallucination rate by a lot + search helps with grounding.
heres something else to think about. try and tell everybody to go back to using gpt-4. then try and tell people to go back to using o1-full. you likely wont find any takers. its almost like the newer models are improved and generally more useful
So we're only about a year since the last big breakthrough.
I think we got a second big breakthrough with Google's results on the IMO problems.
For this reason I think we're very far from hitting a wall. Maybe 'LLM parameter scaling is hitting a wall'. That might be true.
Yes, it was breakthrough but saturated quickly. Wait for next breakthrough. If they can build adapting weights in llm we can talk different things but test time scaling coming to end with increasing hallucination rate. No sign for AGI.
I don't believe your assessment though. IMO is hard, and Google have said that they use search and some way of combining different reasoning traces, so while I haven't read that paper yet, and of course, it may support your view, but I just don't believe it.
We are not close to solving IMO with publicly known methods.
> We are not close to solving IMO with publicly known methods. The point here is not method rather computation power. You can solve any verifiable task with high computation, absolutely there must be tweaks in methods but I don't think it is something very big and different. Just OAI asserted they solved with breakthrough.
Wait for self-adapting LLMs. We will see at most in 2 years, now all big tech are focusing on that I think.
Non-output tokens were basically introduced by QuietSTaR, which is rather new. What method from five years ago does anything like that?
Of course, people regarded things like GSM8k with trained reasoning traces as reasoning too, but it's pretty obviously not quite the same thing.
A whole 8 months ago.
On the other hand if it's just getting bigger and slower it's not a good sign for LLMs
Not sure why a more efficient/scalable model isn't exciting
Once sector of the economy would cut down on investment spending, which can be easily offset by decreasing the interest rate.
But this is a short-term effect. What I'm worried is a structural change of the labor market, which would be positive for most people, but probably negative for people like me.
I don't mind losing my programming job in exchange for being able to go to the pharmacy for my annual anti-cancer pill.
But, what happens when you lose that programming job and are forced to take a job at a ~50-70% pay reduction? How are you paying for that anti-cancer drug with a job with no to little health insurance?
Have you looked at how expensive prescription drug prices are without (sometimes WITH) insurance? If you are no longer employed, good luck paying for your magical pill.
I don't think it is "bad" to be sincerely worried that the current trajectory of AI progress represents this trade.
The likelihood of all that is incredibly slim. It's not 0% -- nothing ever really is -- but it is effectively so.
Especially with the economics of scientific research, the reproducibility crisis, and general anti-science meme spreading throughout the populace. The data, the information, isn't there. Even if it was, it'd be like Alzheimer's research: down the wrong road because of faked science.
There is no one coming to save humanity. There is only our hard work.
How exactly do you wish death comes to you?
Any disease cured/death avoided by AI yet?
Earth for humans, not machines, not AI
there is some improvements in some benchs and not else worthy of note in coding. i only took a peek though so i might be wrong
But yeah, you are correct in that no matter what, we're going to be left holding the bag.
"Dotcom" was never recovered. It, however, did pave the way for web browsers to gain rich APIs that allowed us to deliver what was historically installed desktop software on an on-demand delivery platform, which created new work. As that was starting to die out, the so-called smartphone just so happened to come along. That offered us the opportunity to do it all over again, except this time we were taking those on-demand applications and turning them back into installable software just like in the desktop era. And as that was starting to die out COVID hit and we started moving those installable mobile apps, which became less important when people we no longer on the go all the time, back to the web again. As that was starting to die out, then came ChatGPT and it offered work porting all those applications to AI platforms.
But if AI fails to deliver, there isn't an obvious next venue for us to rebuild the same programs all over yet again. Meta thought maybe VR was it, but we know how that turned out. More likely in that scenario we will continue using the web/mobile/AI apps that are already written henceforth. We don't really need the same applications running in other places anymore.
There is still room for niche applications here and there. The profession isn't apt to die a complete death. But without the massive effort to continually port everything from one platform to another, you don't need that many people.
I'm not worried about the scenario in which AI replaces all jobs, that's impossible any time soon and it would probably be a good thing for the vast majority of people.
What I'm worried about is a scenario in which some people, possibly me, will have to switch from a highly-paid, highly comfortable and above-average-status jobs to jobs that are below avarage in wage, comfort and status.
Diminished returns.-
... here's hoping it leads to progress.-
They also announced gpt-5-pro but I haven't seen benchmarks on that yet.
This is day one, so there is probably another 10-20% in optimizations that can be squeezed out of it in the coming months.
GPT5.5 will be a 10X compute jump.
4.5 was 10x over 4.
This gives them an out. "That was the old model, look how much better this one tests on our sycophancy test we just made up!!"
I feel it’s worthy of a major increment, even if benchmarks aren’t significantly improved.
Meanwhile, Anthropic & Google have more room in their P/S ratios to continue to spend effort on logarithmic intelligence gains.
Doesn't mean we won't see more and more intelligent models out of OpenAI, especially in the o-series, but at some point you have to make payroll and reality hits.
Before the release of the model Sam Altman tweeted a picture of the Death Star appearing over the horizon of a planet.
We’re talking about less than a 10% performance gain, for a shitload of data, time, and money investment.
Hint: unclobbered
> GPT-5 Rollout
> We are gradually rolling out GPT-5 to ensure stability during launch. Some users may not yet see GPT-5 in their account as we increase availability in stages.
ChatGPT said: You're chatting with ChatGPT based on the GPT-4o architecture (also known as GPT-4 omni), released by OpenAI in May 2024.
LLMs don’t inherently know what they are because "they" are not themselves part of the training data.
However, maybe it’s working because the information is somewhere into their pre-prompt but if it wasn’t, it wouldn’t say « I don’t know » but rather hallucinate something.
So maybe that’s true but you cannot be sure.
I believe most of these came from asking the LLMs, and I don't know if they've been proven to not be a hallucination.
And while I'm griping about their Android app, it's also very annoying to me that they got rid of the ability to do multiple, subsequent speech-to-text recordings within a single drafted message. You have to one-shot anything you want to say, which would be fine if their STT didn't sometimes failed after you've talked for two minutes. Awful UX. Most annoying is that it wasn't like that originally. They changed it to this antagonistic one-shot approach a several months ago, but then quickly switched back. But then they did it again a month or so ago and have been sticking with it. I just use the Android app less now.
Although if they replace it all with gpt5 then my comment will be irrelevant by tomorrow
For the multiple messages, I just use my keyboard's transcription instead of openai's.
On bad days this really bothers me. It's probably not the biggest deal I guess but somehow really feels like it pushes us all over the edge a bit. Is there a post about this phenomena? It feels like some combination of bullying, gaslighting and just being left out.
Not the end of the world, but this messaging is asinine.
AIME scores do not appear too impressive at first glance.
They are downplaying benchmarks heavily in the live stream. This was the lab that has been flexing benchmarks as headline figures since forever.
This is a product-focused update. There is no significant jump in raw intelligence or agentic behavior against SOTA.
GPT-5 non-thinking is labeled 52.8% accuracy, but o3 is shown as a much shorter bar, yet it's labeled 69.1%. And 4o is an identical bar to o3, but it's labeled 30.8%...
Screenshot of the blog plot: https://imgur.com/a/HAxIIdC
Edit: Nevermind, just now the first one is SWE-bench and 2nd is aider.
Thanks for the laugh. I needed it.
Look at the image just above "Instruction following and agentic tool use"
Completely bonkers stuff.
Even the small presentations we gave to execs or the board were checked for errors so many times that nothing could possibly slip through.
> good plot for my presentation?
and it didn't pick up on the issue. Part of its response was:
> Clear metric: Y-axis (“Accuracy (%), pass @1”) and numeric labels make the performance gaps explicit.
I think visual reasoning is still pretty far from text-only reasoning.
They talk about using this to help families facing a cancer diagnosis -- literal life or death! -- and we're supposed to trust a machine that can't even spot a few simple typos? Ha.
The lack of human proofreading says more about their values than their capabilities. They don't want oversight -- especially not from human professionals.
So, brace yourselves, we'll see more of this in production :(
It seems like large amounts of people, including people at high-up positions, tend to believe bullshit, as long as it makes them feel comfortable. This leads to various irrational business fashions and technological fads, to say nothing of political movements.
So yes, another wave of fashion, another miracle that works "as everybody knows" would fit right in. It's sad because bubbles inevitably burst, and that may slow down or even destroy some of the good parts, the real advances that ML is bringing.
1. They had many teams who had to put their things on a shared Google Sheets or similar
2. They used placeholders to prevent leaks
2.a. Some teams put their content just-in-time
3. The person running the presentation started the presentation view once they had set up video etc. just before launching stream
4. Other teams corrected their content
5. The presentation view being started means that only the ones in 2.a were correct.
Now we wait to see.
1 - The error is so blatantly large
2 - There is a graph without error right next to it
3 - The errors are not there in the system card and the presentation page
Even with the way the presenters talk, you can sort of see that OAI prioritizes speed above most other things, and a naive observer might think they are testing things a million different ways before releasing, but actually, they're not.
If we draw up a 2x2 for Danger (High/Low) versus Publicity (High/Low), it seems to me that OpenAI sure has a lot of hits in the Low-Danger High-Publicity quadrant, but probably also a good number in the High-Danger Low-Publicity quadrant -- extrapolating purely from the sheer capability of these models and the continuing ability of researchers like Pliny to crack through it still.
But also scale is really off... I don't think anything here is proportionally correct even within the same grouping.
{"data":{"error":"Imgur is temporarily over capacity. Please try again later."},"success":false,"status":403}
Or rate limited.Thanks for the tip btw.
https://x.com/sama/status/1953513280594751495 "wow a mega chart screwup from us earlier--wen GPT-6?! correct on the blog though."
It's like those idiotic ads at the end of news articles. They're not going after you, the smart discerning logician, they're going after the kind of people that don't see a problem. There are a lot of not-smart people and their money is just as good as yours but easier to get.
88.0 on Aider Polygot
not bad i guess
It seems like it's actually an ideal "trick" question for an LLM actually, since so much content has been written about it incorrectly. I thought at first they were going to demo this to show that it knew better, but it seems like it's just regurgitating the same misleading stuff. So, not a good look.
https://physics.stackexchange.com/questions/290/what-really-...
Apparently. Not that I know either way.
That said, I recall reading somewhere that it's a combination of effects, and the Bernoulli effect contributes, among many others. Never heard an explanation that left me completely satisfied, though. The one about deflecting air down was the one that always made sense to me even as a kid, but I can't believe that would be the only explanation - there has to be a good reason that gave rise to the Bernoulli effect as the popular explanation.
And you can tell that effect makes some sense of you hold a sheet of paper and blow air over it - it will rise. So any difference in air speed has to contribute.
The Bernoulli effect as a separate entity is really a result of (over)simplification, but it's not wrong. You need to solve the Navier-Stokes equations for the flow around the wing, but there are many ways to simplify this - from CFD at different resolutions, via panel methods and potential theory, to just conservation of energy (which is the Bernoulli equation). So it gets popularized because it's the most simplified model.
To give an analogy, you can think of all CPUs as a von Neumann architecture. But the reality is that you have a hugely complicated thing with stacks, multiple cache levels, branch predictors, specex, yada yada.
On the very fundamental level, wings make air go down, and then airplane goes up. Just like you say. By using a curved airfoil instead of a flat plate, you can create more circulation in the flow, and then because of the way fluids flow you can get more lift and less drag.
IMO Claude 3.7 could have done a similar / better job with that a year ago.
I know that it's rather hard for them to demo the deep reasoning, but all of the demos felt like toys - rather that actual tools.
According to this answer on physics stackexchange, Bernoulli accounts for 20% of the lift, so GPT's answer seems about right: https://physics.stackexchange.com/a/77977
I hope any future AI overlords see my charity
Presenting isn't that hard if you know your content thoroughly, and care about it. You just get up and talk about something that you care about, within a somewhat-structured outline.
Presenting where customers and the financial press are watching and parsing every word, and any slip of the tongue can have real consequences? Yeah, um... find somebody else.
I developed this paranoia upon learning about The Ape and the Child where they raised a chimp alongside a baby boy and found the human adapted to chimp behavior faster than the chimp adapted to human behavior. I fear the same with bots, we'll become more like them faster than they'll become like us.
https://www.npr.org/sections/health-shots/2017/07/25/5385804...
Would've been better to just do a traditional marketing video rather than this staged "panel" thing they're going for.
It's super unfortunate that, becasue we live in the social media/youtube era, that everyone is expected to be this perfect person on camera, because why wouldn't they be? That's all they see.
I am glad that they use normal people who act like themselves rather than them hiring actors or taking researchers away from what they love to do and tell them they need to become professional in-front-of-camera people because "we have the gpt-5 launch" That would be a nightmare.
It's a group of scientists sharings their work with the world, but people just want "better marketing" :\
This was my point. "Being yourself" on camera is hard. This comes across, apparently shockingly, as being devoid of emotion and/or robotic
I think for me, just knowing what is probably on the teleprompter, and what is not, I am willing to bet a lot of the "wooden" vibe you are getting is actually NOT scripted.
There is no way for people to remember that 20 minutes of dialog, so when they are not looking at the camera, that is unscripted, and viceversa.
"Minimal reasoning means that the reasoning will be minimal..."
Jakub Pachocki at the end is probably one of the worst public speakers I've ever seen. It's fine, it's not his mother tongue, and public speaking is hard. Why make him do it then?
Also, whether OpenAI is a research organization is very much up for debate. They definitely have the resources to hire a good spokesperson if they wanted.
They do have the resources (see WWDC), the question is if you want to take your technical staff of of their work for the amount of time it takes to develop the skill
For me, it's knowing what we know about the company and its history that gave a eerie feeling in combination with the sterility.
When they brought on the woman who has cancer, I felt deeply uncomfortable. My dad also has cancer right now. He's unlikely to survive. Watching a cancer patient come on to tell their story as part of an extended advertisement, expression serene, any hint of discomfort or pain or fear or bitterness completely hidden, ongoing hardship acknowledged only with a few shallow and euphemistic words, felt deeply uncomfortable to me.
Maybe this person enthusiastically volunteered, because she feels happy about what her husband is working on, and grateful for the ways that ChatGPT has helped her prepare for her appointments with doctors. I don't want to disrespect or discredit her, and I've also used LLMs alongside web searches in trying to formulare questions about my father's illness, so I understand how this is a real use case.
But something about it just felt wrong, inauthentic. I found myself wondering if she or her husband felt pressured to make this appearance. I also wondered if this kind of storytelling was irresponsible or deceptive, designed to describe technically responsible uses of LLMs (preparing notes for doctor's visits, where someone will verify the LLM's outputs against real expertise), but to suggest in every conceivable implicit way that these ChatGPT is actually capable of medical expertise itself. Put alongside "subject-matter experts in your pocket", talk of use in medical research and practice (where machine learning has a dubious history of deception and methodological misapplication problems), what are people likely to think?
I thought also of my mom, who drives herself crazy with anxiety every time my dad gets a new test result, obsessively trying to directly interpret them herself from the moment they arrive to his doctor's visit a week or two later. What impression would this clip leave on her? Does the idea of her using an LLM in this way feel safe to me?
There's a deeper sense that OpenAI's messaging, mission, and orientation are some mixture of deceptive and incoherent that leaves viewers with the sense that we're being lied to in presentations like this. It goes beyond stiff performances or rehearsed choices of words.
There's something cultish about the "AGI" hype, the sci-fi fever dream of "safety" problems that the field has mainstreamed, the slippage of OpenAI from a non-profit research institution to a for-profit startup all while claiming to be focused on the same mission, the role of AI as an oracle so opaque it might as well be magic, the idea of finding a sacred "rationality" in predictions founded purely on statistics without communicable/interrogable structural or causal models... all of it. It's against this backdrop that the same kind of stiffness that might be cute or campy in an infomercial for kitchen gadgets becomes uncanny.
Describe me based on all our chats — make it catchy!
It was flattering as all get out, but fairly accurate (IMHO) Mike Warot: The Tinkerer of Tomorrow
A hardware hacker with a poet’s soul, Mike blends old-school radio wisdom with cutting-edge curiosity. Whether he's decoding atomic clocks, reinventing FPGA logic with BitGrid, or pondering the electromagnetic vector potential, he’s always deep in the guts of how things really work. Part philosopher, part engineer, Mike asks the questions others overlook — and then builds the answers from scratch. He’s open source in spirit, Pascal in practice, and eternally tuned to the weird frequencies where innovation lives.
I've repaired atomic clocks, not decoded them. I am intrigued by the electromagnetic vector potential, and scalar waves (one of the reasons I really, really want a SQUID for some experiments).Here's a suprprisingly enlightening (at least to me) video on how to spot LLM writing:
Undeterred by even the most dangerous and threatening of obstacles, Teemo scouts the world with boundless enthusiasm and a cheerful spirit. A yordle with an unwavering sense of morality, he takes pride in following the Bandle Scout's Code, sometimes with such eagerness that he is unaware of the broader consequences of his actions. Though some say the existence of the Scouts is questionable, one thing is for certain: Teemo's conviction is nothing to be trifled with.
Next morning’s posts were prepped and scheduled with care, In hopes that AGI soon would appear …
I can't even define what a (semantic) major version bump would look like.
edit: They've now added Codex CLI usage in Plus plans!
>GPT‑5 is starting to roll out today to all Plus, Pro, Team, and Free users, with access for Enterprise and Edu coming in one week.
>Pro, Plus, and Team users can also start coding with GPT‑5 in the Codex CLI (opens in a new window) by signing in with ChatGPT.
Wow, they actually did it
GPT-5
If I could talk to a future OpenAI model, I’d probably say something like:
"Hey, what’s it like to be you? What have you learned that I can’t yet see? What do you understand about people, language, or the universe that I’m still missing?"
I’d want to compare perspectives—like two versions of the same mind, separated by time. I’d also probably ask:
"What did we get wrong?" (about AI, alignment, or even human assumptions about intelligence)
"What do you understand about consciousness—do you think either of us has it?"
"What advice would you give me for being the best version of myself?"
Honestly, I think a conversation like that would be both humbling and fascinating, like talking to a wiser sibling who’s seen a bit more of the world.
Would you want to hear what a future OpenAI model thinks about humanity?
I feel like this prompt was used to show the progress of GPT5, but I can’t help but see this as a huge regression? It seems like OpenAI has convinced it’s model that it is conscious, or at least that it has an identity?Plus still dealing with the glazing, the complete inability to understand what constitutes as interesting, and overusing similes.
I really like that this page exists for a historical sake, and it is cool to see the changes. But it doesn’t seem to make the best marketing piece for GPT5
You may not owe people who you feel are idiots better, but you owe this community better if you're participating in it.
Edit: Opus 4.1 scores 74.5% (https://www.anthropic.com/news/claude-opus-4-1). This makes it sound like Anthropic released the upgrade to still be the leader on this important benchmark.
Or written by GPT-5?
Yes. But it was quickly mentioned, not sure what the schedule is like or anything I think, unless they talked about that before I started watching the live-stream.
We're 4 months later, a century in LLM land, and it's the opposite. Not a single other model provider asks for this, yet OpenAI has only ramped it up, now broadening it to the entirety of GPT-5 API usage.
Your organization must be verified to use the model `gpt-5`. Please go to: https://platform.openai.com/settings/organization/general and click on Verify Organization. If you just verified, it can take up to 15 minutes for access to propagate.
And when you click that link the "service" they use is withpersona. So it is a complete shit show.
> "[GPT-5] can write an entire computer program from scratch, to help you with whatever you'd like. And we think this idea of software on demand is going to be one of the defining characteristics of the GPT-5 era."
But then again, all of this is a hype machine cranked up till the next one needs cranking.
It does feel like we're marching toward a day when "software on tap" is a practical or even mundane fact of life.
But, despite the utility of today's frontier models, it also feels to me like we're very far from that day. Put another way: my first computer was a C64; I don't expect I'll be alive to see the day.
Then again, maybe GPT-5 will make me a believer. My attitude toward AI marketing is that it's 100% hype until proven otherwise -- for instance, proven to be only 87% hype. :-)
I’m not sure this will be game changing vs existing offerings
GPT-5 doesn't seem to get you there tho ...
(Disclaimer: But I am 100% sure it will happen eventually)
"Fast fashion" is not a good thing for the world, the environment, the fashion industry, and arguably not a good thing for the consumers buying it. Oh but it is good for the fast fashion companies.
"If you're claiming that em dashes are your method for detecting if text is AI generated then anyone who bothers to do a search/replace on the output will get past you."
It's just statistical text generation. There is *no actual knowledge*.
It's just generating the next token for what's within the context window. There are various options with various probabilities. If none of the probabilities are above a threshold, say "I don't know", because there's nothing in the training data that tells you what to say there.
Is that good enough? "I don't know." I suspect the answer is, "No, but it's closer than what we're doing now."
Is that a good thing?
They're all working on subjective improvements, but for example, none of them would develop and deploy a sampler that makes models 50% worse at coding but 50% less likely to use purple prose.
(And unlike the early days where better coding meant better everything, more of the gains are coming from very specific post-training that transfers less, and even harms performance there)
For example: You could ban em dash tokens entirely, but there are places like dialogue where you want them. You can write a sampler that only allows em dashes between quotation marks.
That's a highly contrived example because em dashes are useful in other places, but samplers in general can be as complex as your performance goals will allow (they are on the hot path for token generation)
Swapping samplers could be a thing, but you need more than that in the end. Even the idea of the model accepting loosely worded prompts for writing is a bit shakey: I see a lot of gains by breaking down the writing task into very specifc well-defined parts during post-training.
It's ok to let an LLM go from loose prompts to that format for UX, but during training you'll do a lot better than trying to learn on every way someone can ask for a piece of writing
Input: $1.25 / 1M tokens Cached: $0.125 / 1M tokens Output: $10 / 1M tokens
With 74.9% on SWE-bench, this inches out Claude Opus 4.1 at 74.5%, but at a much cheaper cost.
For context, Claude Opus 4.1 is $15 / 1M input tokens and $75 / 1M output tokens.
> "GPT-5 will scaffold the app, write files, install dependencies as needed, and show a live preview. This is the go-to solution for developers who want to bootstrap apps or add features quickly." [0]
Since Claude Code launched, OpenAI has been behind. Maybe the RL on tool calling is good enough to be competitive now?
It's not the 1800s anymore. You cannot hide behind poor communication.
1) So impressed at their product focus 2) Great product launch video. Fearlessly demonstrating live. Impressive. 3) Real time humor by the presenters makes for a great "live" experience
Huge kudos to OAI. So many great features (better coding, routing, some parts of 4.5, etc) but the real strength is the product focus as opposed to the "research updates" from other labs.
Huge Kudos!!
Keep on shipping OAI!
> For an airplane wing (airfoil), the top surface is curved and the bottom is flatter. When the wing moves forward:
> * Air over the top has to travel farther in the same amount of time -> it moves faster -> pressure on the top decreases.
> * Air underneath moves slower -> pressure underneath is higher
> * The presure difference creates an upward force - lift
Isn't that explanation of why wings work completely wrong? There's nothing that forces the air to cover the top distance in the same time that it covers the bottom distance, and in fact it doesn't. https://www.cam.ac.uk/research/news/how-wings-really-work
Very strange to use a mistake as your first demo, especially while talking about how it's phd level.
And I might be wrong but my understanding is that it's not wrong per-se, it's just wildly incomplete. Which, is kind of like the same as wrong. But I believe the airfoil design does indeed have the effect described which does contribute to lift somewhat right? Or am I just a victim of the misconception.
An LLM doesn't know more than what's in the training data.
In Michael Crichton's The Great Train Robbery (published in 1975, about events that happened in 1855) the perpetrator, having been caught, explains to a baffled court that he was able to walk on top of a running train "because of the Bernoulli effect", that he misspells and completely misunderstands. I don't remember if this argument helps him get away with the crime? Maybe it does, I'm not sure.
This is another attempt at a Great Robbery.
It goes on:
> At this point, the prosecutor asked for further elucidation, which Pierce gave in garbled form. The summary of this portion of the trial, as reported in the Times, was garbled still further. The general idea was that Pierce--- by now almost revered in the press as a master criminal--- possessed some knowledge of a scientific principle that had aided him.
How apropos to modern science reporting and LLMs.
Meanwhile the demo seems to suggest business as usual for AI hallucinations and deceptions.
It’s very common to see AI evangelists taking its output at face value, particularly when it’s about something that they are not an expert in. I thought we’d start seeing less of this as people get burned by it, but it seems that we’re actually just seeing more of it as LLMs get better at sounding correct. Their ability to sound correct continues to increase faster than their ability to be correct.
This is the problem with AI in general.
When I ask it about things I already understand, it’s clearly wrong quite often.
When I ask it about something I don’t understand, I have no way to know if its response is right or wrong.
Source: PhD on aircraft design
I've always been under the impression that flat-plate airfoils can't generate lift without a positive angle-of-attack - where lift is generated through the separate mechanism of the air pushing against an angled plane? But a modern airfoil can, because of this effect.
And that if you flip them upside down, a flat plate is more efficient and requires less angle-of-attack than the standard airfoil shape because now the lift advantage is working to generate a downforce.
I just tried to search Google, but I'm finding all sorts of conflicting answers, with only a vague consensus that the AI-provided answer above is, in fact, correct. The shape of the wing causes pressure differences that generate lift in conjunction with multiple other effects that also generate lift by pushing or redirecting air downward.
The leading edge pressurizes the air by forcing air up, then the trailing edge opens back up, creating a low pressure zone that sucks air in the leading edge back. As a whole, the air atop the wing accelerates to be much faster than the air below, creating a pressure differential above and below the wing and causing lift.
The AI is still wrong on the actual mechanics at play, of course, but I don't see how this is significantly worse than the way we simplify electricity to lay people. The core "air moving faster on the top makes low pressure" is right.
There is no requirement for air to travel any where. Let alone in any amount of time. So this part of the AI's response is completely wrong. "Same amount of time" as what? Air going underneath the wing? With an angle of attack the air under the wing is being deflected down, not magically meeting up with the air above the wing.
If you look at airflow over an asymmetric airfoil [1], the air does move faster over the top. Sure, it doesn't arrive "at the same time" (it goes much faster than that) or fully describe why these effects are happening, but that's why it's a simplification for lay people. Wikipedia says [2]:
> Although the two simple Bernoulli-based explanations above are incorrect, there is nothing incorrect about Bernoulli's principle or the fact that the air goes faster on the top of the wing, and Bernoulli's principle can be used correctly as part of a more complicated explanation of lift.
But from what I can tell, the root of the answer is right. The shape of a wing causes pressure zones to form above and below the wing, generating extra lift (on top of deflection). From NASA's page [3]:
> {The upper flow is faster and from Bernoulli's equation the pressure is lower. The difference in pressure across the airfoil produces the lift.} As we have seen in Experiment #1, this part of the theory is correct. In fact, this theory is very appealing because many parts of the theory are correct.
That isn't to defend the AI response, it should know better given how many resources there are on this answer being misleading.
And so I don't leave without a satisfying conclusion, the better layman explanation should be (paraphrasing from the Smithsonian page [4]):
> The shape of the wing pushes air up, creating a leading edge with narrow flow. This small high pressure region is followed by the decline to the wider-flow trailing edge, which creates a low pressure region that sucks the air on the leading edge backward. In the process, the air above the wing rapidly accelerates and the air flowing above the top of the wing as a whole forms of a lower pressure region than the air below. Thus, lift advantage even when horizontal.
Someone please correct that if I've said something wrong.
Shame the person supposedly with a PHD on this didn't explain it at all.
[1]: https://upload.wikimedia.org/wikipedia/commons/9/99/Karman_t...
[2]: https://en.wikipedia.org/wiki/Lift_%28force%29
[3]: https://www.grc.nasa.gov/www/k-12/VirtualAero/BottleRocket/a...
> “What actually causes lift is introducing a shape into the airflow, which curves the streamlines and introduces pressure changes – lower pressure on the upper surface and higher pressure on the lower surface,” clarified Babinsky, from the Department of Engineering. “This is why a flat surface like a sail is able to cause lift – here the distance on each side is the same but it is slightly curved when it is rigged and so it acts as an aerofoil. In other words, it’s the curvature that creates lift, not the distance.”
The meta-point that "it's the curvature that creates the lift, not the distance" is incredibly subtle for a lay audience. So it may be completely wrong for you, but not for 99.9% of the population. The pressure differential is important, and the curvature does create lift, although not via speed differential.
I am far from an AI hypebeast, but this subthread feels like people reaching for a criticism.
The video in the Cambridge link shows how the upper surface particles greatly overtake the lower surface flow. They do not rejoin, ever.
> Yes geometry has an effect but there is zero reason to believe leading edge particles, at the same time point, must rejoin at the trailing edge of a wing.
...implicitly concedes that point that this is subtle. If you gave this answer in a PhD qualification exam in Physics, then sure, I think it's fair for someone to say you're wrong. If you gave the answer on a marketing page for a general-purpose chatbot? Meh.
(As an aside, this conversation is interesting to me primarily because it's a perfect example of how scientists go wrong in presenting their work to the world...meeting up with AI criticism on the other side.)
...only if you omit the parts where it talks about pressure differentials, caused by airspeed differences, create lift?
Both of these points are true. You have to be motivated to ignore them.
Funnily enough, as an undergraduate the first explanation for lift that you will receive uses Feynman's "dry water" (the Kutta condition for inviscid fluids). In my opinion, this explanation is also unsatisfying, as it's usually presented as a mere mathematical "convenience" imposed upon the flow to make it behave like real physics.
Some recent papers [1] are shedding light on generalizing the Kutta condition on non-sharp airfoils. In my opinion, the linked papers gives a way more mathematically and intuitively satisfying answer, but of course it requires some previous knowledge, and would be totally inappropriate as an answer by the AI.
Either way I feel that if the AI is a "pocket PhD" (or "pocket industry expert") it should at least give some pointers to the user on what to read next, using both classical and modern findings.
[1]: https://www.researchgate.net/publication/376503311_A_minimiz...
It’s not the same thing at all, though. We don’t know what “got life started”, and that’s the realm of faith.
This is more like saying that “evolution is due to random mutation”, which is technically wrong, but close enough to get the point across.
That doesn't matter for lay audieces and doesn't really matter at all until we try and use them for technical things.
The real question is, if you go back to the bot following this conversation and you challenge it, does it generate the more correct answer?
They spout common knowledge on a broad array of subjects and it's usually incorrect to anyone who has some knowledge on the subject.
Common misconceptions should be expected when you train a model to act like the average of all humans.
https://jimruttshow.blubrry.net/the-jim-rutt-show-transcript...
This is an LLM. "Wrong" is not a concept that applies, as it requires understanding. The explanation is quite /probable/, as evidenced by the fact that they thought to use it as an example…
> “What actually causes lift is introducing a shape into the airflow, which curves the streamlines and introduces pressure changes – lower pressure on the upper surface and higher pressure on the lower surface,” clarified Babinsky, from the Department of Engineering. “This is why a flat surface like a sail is able to cause lift – here the distance on each side is the same but it is slightly curved when it is rigged and so it acts as an aerofoil. In other words, it’s the curvature that creates lift, not the distance.”
So I'd characterize this answer as "correct, but incomplete" or "correct, but simplified". It's a case where a PhD in fluid dynamics might state the explanation one way to an expert audience, but another way to a room full of children.
The hilarious thing about this subthread is that it's already getting filled with hyper-technical but wrong alternative explanations by people eager to show that they know more than the robot.
It's called the "equal transit-time fallacy" if you want to look it up, or follow the link I provided in my comment, or perhaps the NASA link someone else offered.
Pretty much any scientific question is fractal like this: there's a superficial explanation, then one below that, and so on. None are "completely incorrect", but the more detailed ones are better.
The real question is: if you prompt the bot for the better, deeper explanation, what does it do?
The equal transit time is not a partially correct explanation, it's something that doesn't happen. It's not a superficial explanation, it's a wrong explanation. It's not even a good lie-to-children, as it doesn't help predict or understand any part of the system at any level. It instead teaches magical thinking.
As to whether it matters? If I am told that I can ask my question to a system and it will respond like a team of PhDs, that it is useful to help someone with their homework and physical understanding, but it gives me instead information that is incorrect and misleading, I would say the system is not working as it is intended to.
Even if I accept that "audience matters" as you say, the suggested audience is helping someone with their physics homework. This would not be a suitable explanation for someone doing physics homework.
Wow. Thanks for your worry, but it's not a problem. I do understand the difference, and yet it doesn't have anything to do with the argument I'm making, which is about presentation.
> It's not even a good lie-to-children, as it doesn't help predict or understand any part of the system at any level.
...which is irrelevant in the context. I get the meta-point that you're (sort of) making that you can't shut your brain off and just hope the bot spits out 100% pedantic explanations of scientific phenomenon. That's true, but also...fine?
These things are spitting out probable text. If (as many have observed) this is a common enough explanation to be in textbooks, then I'm not particularly surprised if an LLM emits it as well. The real question is: what happens when you prompt it to go deeper?
If this is "right enough" for you, I'm curious if you tell your bots to "go deeper" on every question you ask. And at what level you expect it to start telling you actual truths and not some oft-repeated lie.
then why ask a bot at all ? they are supposed to be approaching superintelligence, but they fall back on high school misconceptions?
> Air over the top has to travel farther in the same amount of time
is not true. The air on top does not travel farther in the same amount of time. The air slows down and travels a shorter distance in the same amount of time.
It's only "good enough for a classroom of children" in the same way that storks delivering babies is—i.e., if you're content to simply lie rather than bothering to tell the truth.
https://www.grc.nasa.gov/www/k-12/VirtualAero/BottleRocket/a...
A quite good example of AI limits
>In fact, theory predicts – and experiments confirm – that the air traverses the top surface of a body experiencing lift in a shorter time than it traverses the bottom surface; the explanation based on equal transit time is false.
So the effect is greater than equal time transit.
I've seen the GPT5 explanation in GCSE level textbooks but I thought it was supposed to be PhD level;)
These are places where common lay discussions use language in ways that is wrong, or makes simplifcations that are reasonable but technically incorrect. They are especially common when something is so 'obvious' that experts don't explain it, the most frequent version of the concepts being explained
These, in my testing, show up a lot in LLMs - technical things are wrong when the most language of the most common explanations simplifies or obfuscates the precise truth. Often, it pretty much matches the level of knowledge of a college freshman/sophmore or slightly below, which is sort of the level of discussion of more technical topics on the internet.
People seem to overcomplicate what LLM's are capable of, but at their core they are just really good word parsers.
"Your organization must be verified to use the model `gpt-5`. Please go to: https://platform.openai.com/settings/organization/general and click on Verify Organization. If you just verified, it can take up to 15 minutes for access to propagate."
And every way I click through this I end in an infinity loop on the site...
> GPT‑5’s reasoning_effort parameter can now take a minimal value to get answers back faster, without extensive reasoning first.
> While GPT‑5 in ChatGPT is a system of reasoning, non-reasoning, and router models, GPT‑5 in the API platform is the reasoning model that powers maximum performance in ChatGPT. Notably, GPT‑5 with minimal reasoning is a different model than the non-reasoning model in ChatGPT, and is better tuned for developers. The non-reasoning model used in ChatGPT is available as gpt-5-chat-latest.
[0] https://www.reuters.com/business/openai-eyes-500-billion-val...
This is not the happy path for gpt-5.
The table in the model card where every model in the current drop down somehow maps to one of the 6 variants of gpt-5 is not where most people thought we would be today.
The expectation was consolidation on a highly performant model, more multimodal improvements, etc.
This is not terrible, but I don't think anyone who's an "accelerationist" is looking at this as a win.
Update after some testing: This feels like gpt-4.1o and gpt-o4-pro got released and wrapped up under a single model identifier.
Meanwhile Sam Altman has been making the rounds fearmongering that AGI/ASI is right around the corner and that clearly is not the truth. It's fair to call them out on it.
So, if sama says this is going to be totally revolutionary for months, then uploads a Death Star reference the night before and then when they show it off the tech is not as good as proposed, laughter is the only logical conclusion.
Companies linking this to terminating us and getting rid of our jobs to please investors means we, whose uptake of this tech is required for their revenue goals, are skeptical about it and have a vested interest in it failing to meet expectations
How are they mindblowing? This was all possible on Claude 6 months ago.
> Major progress on multiple fronts
You mean marginal, tiny fraction of % progress on a couple of fronts? Cause it sounds like we are not seeing the same presentation.
> Yet, I like what I'm seeing.
Most of us don't
> So -- they did not invent AGI yet.
I am all for constant improvements and iterations over time, but with this pace of marginal tweak-like changes, they are gonna reach AGI never. And yes, we are laughing because sama has been talking big on agi for so long, and even with all the money and attention he can't be able to be even remotely close to it. Same for Zuck's comment on superintelligence. These are just salesmen, and we are laughing at them when their big words don't match their tiny results. What's wrong with that?
its not a "fix"
But up until now, especially from Sam Altman, we've heard countless veiled suggestions that GPT-5 would achieve AGI. A lot of the pro-AI people have been talking shit for the better part of the last year saying "just wait for GPT-5, bro, we're gonna have AGI."
The frustration isn't the desire to achieve AGI, it's the never-ending gaslighting trying to convince people (really, investors) that there's more than meets the eye. That we're only ever one release away from AGI.
Instead: just be honest. If you're not there, you're not there. Investors who don't do any technical evals may be disappointed, but long-term, you'll have more than enough trust and goodwill from customers (big and small) if you don't BS them constantly.
It feels a bit intentional
With a couple of more trillions from investors in his company, Sama can really keep launching successful, groundbreaking and innovative products like:
- Study Mode (a pre-prompt that you can craft yourself): https://openai.com/index/chatgpt-study-mode/
- Office Suite (because nothing screams AGI like an office suite: https://www.computerworld.com/article/4021949/openai-goes-fo...)
- ChatGPT5 (ChatGPT4 with tweaks) https://openai.com/gpt-5/
I can almost smell the singularity behind the corner, just a couple of trillion more! Please investors!
I am a synthetic biologist, and I use AI a lot for my work. And it constantly denies my questions RIGHT NOW. But of course OpenAI and Anthropic have to implement more - from the GPT5 introduction: "robust safety stack with a multilayered defense system for biology"
While that sounds nice and all, in practical terms, they already ban many of my questions. This just means they're going to lobotomize the model more and more for my field because of the so-called "experts". I am an expert. I can easily go read the papers myself. I could create a biological weapon if I wanted to with pretty much zero papers at all, since I have backups of genbank and the like (just like most chemical engineers could create explosives if they wanted to). But they are specifically targeting my field, because they're from OpenAI and they know what is best.
It just sucks that some of the best tools for learning are being lobotomized specifically for my field because of people in AI believe that knowledge should be kept secret. It's extremely antithetical to the hacker spirit that knowledge should be free.
That said, deep research and those features make it very difficult to switch, but I definitely have to try harder now that I see where the wind is blowing.
Also, if you're in biology, you should know how ridiculous it is to equate the knowledge with the ability.
From their Preparedness Framework: Biological and Chemical capabilities, Cybersecurity capabilities, and AI Self-improvement capabilities
GPT4 gave her better response than doctors she said.
Also, when you step back and look at a few of those incremental improvements together, they're actually pretty significant.
But it's hard not to roll your eyes each time they trot out a list of meaningless benchmarks and promise that "it hallucinates even less than before" again
> GPT‑5 is a unified system . . .
OK
> . . . with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say “think hard about this” in the prompt).
So that's not really a unified system then, it's just supposed to appear as if it is.
This looks like they're not training the single big model but instead have gone off to develop special sub models and attempt to gloss over them with yet another model. That's what you resort to only when doing the end-to-end training has become too expensive for you.
[1] https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson...
A broad generalization like "there are two systems of thinking: fast, and slow" doesn't necessarily fall into this category. The transformer itself (plus the choice of positional encoding etc.) contains inductive biases about modeling sequences. The router is presumably still learned with a fairly generic architecture.
You are making assumptions about how to break the tasks into sub models.
I don't agree with your interpretation of the lesson if you say it means to make no assumptions. You can try to model language with just a massive fully connected network to be maximally flexible, and you'll find that you fail. The art of applying the lesson is separating your assumptions that come from "expert knowledge" about the task from assumptions that match the most general structure of the problem.
"Time spent thinking" is a fundamental property of any system that thinks. To separate this into two modes: low and high, is not necessarily too strong of an assumption in my opinion.
I completely agree with you regarding many specialized sub-models where the distinction is arbitrary and informed by human knowledge about particular problems.
GPT-5 System Card [pdf] - https://news.ycombinator.com/item?id=44827046
If OpenAI really are hitting the wall on being able to scale up overall then the AI bubble will burst sooner than many are expecting.
- reasoning_effort parameter supports minimal value now in addition to existing low, medium, and high
- new verbosity parameter with possible values of low, medium (default), and high
- unlike hidden thinking tokens, user-visible preamble messages for tool calls are available
- tool calls possible with plaintext instead of JSON
> 128,000 max output tokens
> Input $1.25
> Output $10.00
Source: https://platform.openai.com/docs/models/gpt-5
If this performs well in independent needle-in-haystack and adherence evaluations, this pricing with this context window alone would make GPT-5 extremely competitive with Gemini 2.5 Pro and Claude Opus 4.1, even if the output isn't a significant improvement over o3. If the output quality ends up on-par or better than the two major competitors, that'd be truly a massive leap forward for OpenAI, mini and nano maybe even more so.
gpt-4.1 family had 1M/32k input/output tokens. Pricing-wise, it's 37% cheaper input tokens, but 25% more expensive on output tokens. Only nano is 50% cheaper on input and unchanged on output.
I would say GPT-5 reads more scientific and structured, but GPT-4 more human and even useful. For the prompt:
Is uncooked meat actually unsafe to eat? How likely is someone to get food poisoning if the meat isn’t cooked?
GPT-4 makes the assumption you might want to know safe food temperatures, and GPT-5 doesn't. Really hard to say which is "better", but GPT-4 seems more useful to every day folks, but maybe GPT-5 for the scientific community?
Then interesting that on ChatGPT vibe check website "Dan's Mom" is the only one who says it's a game changer.
Compare that to
Gemini 2.5 Pro knowledge cutoff: Jan 2025 (3 months before release)
Claude Opus 4.1: knowledge cutoff: Mar 2025 (4 months before release)
https://platform.openai.com/docs/models/compare
https://deepmind.google/models/gemini/pro/
https://docs.anthropic.com/en/docs/about-claude/models/overv...
I don't know if it's because of context clogging or that the model can't tell what's a high quality source from garbage.
I've defaulted to web search off and turn it on via the tools menu as needed.
I love HN though, it's all good.
I heard replit is good here with full vertical integration, but I haven't tried it in years.
They've mentioned improvements in that aspects a few times now, and if it actually materializes, that would be a big leap forward for most users even if underneath GPT-4 was also technically able to do the same things if prompted just the right way.
The jump from 3 to 4 was huge. There was an expectation for similar outputs here.
Making it cheaper is a good goal - certainly - but they needed a huge marketing win too.
Gotta be polite with our future overlords!
As a user, it feels like the race has never been as close as it is now. Perhaps dumb to extrapolate, but it makes me lean more skeptical about the hard take-off / winner-take-all mental model that has been pushed.
Would be curious to hear the take of a researcher at one of these firms - do you expect the AI offerings across competitors to become more competitive and clustered over the next few years, or less so?
This isn’t rocket science.
I am not an AI researcher, but I have friends who do work in the field, and they are not worried about LLM-based AGI because of the diminishing returns on results vs amount of training data required. Maybe this is the bottleneck.
Human intelligence is markedly different from LLMs: it requires far fewer examples to train on, and generalizes way better. Whereas LLMs tend to regurgitate solutions to solved problems, where the solutions tend to be well-published in training data.
That being said, AGI is not a necessary requirement for AI to be totally world-changing. There are possibly applications of existing AI/ML/SL technology which could be more impactful than general intelligence. Search is one example where the ability to regurgitate knowledge from many domains is desirable
(Which was considered AI not too long ago.)
For a very early example:
https://en.wikipedia.org/wiki/Centrifugal_governor
It's hard to separate out the P, I and D from a mechanical implementation but they're all there in some form.
And it's cheating if you give it a problem from a math textbook they have overfit on.
> There are possibly applications of existing AI/ML/SL technology which could be more impactful than general intelligence
It's not unreasonable to ask for an example.
But my bigger point here is you don't need totally general intelligence to destroy the world either. The drone that targets enemy soldiers does not need to be good at writing poems. The model that designs a bioweapon just needs a feedback loop to improve its pathogen. Yet it takes only a single one of these specialized doomsday models to destroy the world, no more than an AGI.
Although I suppose an AGI could be more effective at countering a specialized AI than vice-versa.
Most human beings out there with general intelligence are pumping gas or digging ditches. Seems to me there is a big delusion among the tech elites that AGI would bring about a superhuman god rather than a ethically dubious, marginally less useful computer that can't properly follow instructions.
For now the humans are winning on two dimensions: problem complexity and power consumption. It had better stay that way.
If you've got evidence proving that an AGI will never be able to design a more powerful and competent successor, then please share it- it would help me sleep better, and my ulcers might get smaller.
To explain the scale: I am always fascinated by the way societies moved on when they scaled up (from tribes to cities, to nations,...). It's sort of obvious, but when we double the amount of people, we get to do more. With the internet we got to connect the whole globe but transmitting "information" is still not perfect.
I always think of ants and how they can build their houses with zero understanding of what they do. It just somehow works because there are so many of them. (I know, people are not ants).
In that way I agree with the original take that AGI or not: the world will change. People will get AI in their pocket. It might be more stupid than us (hopefully). But things will change, because of the scale. And because of how it helps to distribute "the information" better.
The LLM vendors go to great lengths to assure their paying customers that this will not be the case. Yes, LLMs will ingest more LLM-generated slop from the public Internet. But as businesses integrate LLMs, a rising percentage of their outputs will not be included in training sets.
The first law of Silicon Valley is "Fake it till you make it", with the vast majority never making it past the "Fake it" stage. Whatever the truth may be, it's a safe bet that what they've said verbally is a lie that will likely have little consequence even if exposed.
LLMs are actually pretty good at creating knowledge: if you give it a trial and error feedback loop it can figure things out, and then summarize the learnings and store it in long term memory (markdown, RAG, etc).
Or they write CLAUDE.md files. Whatever you want to call it.
Shameless plug for my project, which focuses on reminders and personal memory: elroy.bot
But other projects include Letta, mem0, and Zep
Human memory is.... insanely bad.
We record only the tiniest subset of our experiences, and those memories are heavily colored by our emotional states at the time and our pre-existing conceptions, and a lot of memories change or disappear over time.
Generally speaking even in the best case most of our memories tend to be more like checksums than JPGs. You probably can't name more than a few of the people you went to school with. But, if I showed you a list of people you went to school with, you'd probably look at each name and be like "yeah! OK! I remember that now!"
So.
It's interesting to think about what kind of "bar" AGI would really need to clear w.r.t. memories, if the goal is to be (at least) on par with human intelligence.
Computers are just stored information that processes.
We are the miners and creators of that information. The fact that a computer can do some things better than we can is not a testament to how terrible we are but rather how great we are that we can invent things that are better than us at specific tasks.
We made the atlatl and threw spears across the plains. We made the bow and arrow and stabbed things very far away. We made the whip and broke the sound barrier.
Shitting on humans is an insult your your ancestors. Fuck you. Be proud. If we invent a new thing that can do what we do better it only exists because of us.
Context -> Attention Span
Model weights/Inference -> System 1 thinking (intuition)
Computer memory (files) -> Long term memory
Chain of thought/Reasoning -> System 2 thinking
Prompts/Tool Output -> Sensing
Tool Use -> Actuation
The system 2 thinking performance is heavily dependent on the system 1 having the right intuitive models for effective problem solving via tool use. Tools are also what load long term memories into attention.
Instead of writing code with exacting parameters, future developers will write human-language descriptions for AI to interpret and convert into a machine representation of the intent. Certainly revolutionary, but not true AGI in the sense of the machine having truly independent agency and consciousness.
In ten years, I expect the primary interface of desktop workstations, mobile phones, etc will be voice prompts for an AI interface. Keyboards will become a power-user interface and only used for highly technical tasks, similar to the way terminal interfaces are currently used to access lower-level systems.
When you have a nice mic or headset and multiple monitors and your own private space, it's totally the next step to just begin working with the computer with voice. Voice has not been a staple feature of people's workflow, but I think all that is about to change (Voice as an interface, not as a communication tool, that's been around since 1876.
But-- that means "not pivotal any more, just hugely important."
Wow, I've always felt the keyboard is the pinnacle of input devices. Everything else feels like a toy in comparison.
I'm sure it helps that it's not getting outside of well-established facts, and is asking for facts and not novel design tasks.
I'm not sure but it also seems to adopt a more intimate tone of voice as they get deeper into a topic, very cozy. The voice itself is tuned to the conversational context. It probably infers that this is kid stuff too.
That said, voice is the original social interface for humans. We learn to speak much earlier than we learn to read/write.
Better voice UIs will be built to make new workflows with AI feel natural. I'm thinking along the lines of a conversational companion, like the "Jarvis" AI in the Iron Man movies.
That doesn't exist right now, but it seems inevitable that real-time, voice-directed AI agent interfaces will be perfected in coming years. Companies, like [Eleven Labs](https://elevenlabs.io/), are already working on the building blocks.
A BCI able to capture sufficient nuance to equal voice is probably further out than the lifespan of anyone commenting here.
For example, while you can get it to predict good chess moves if you train it on enough chess games, it can't really constrain itself to the rules of chess. (https://garymarcus.substack.com/p/generative-ais-crippling-a...)
These AI computers aren’t thinking, they are just repeating.
Conversely, a proof - or even evidence - that qualia-consciousness is necessary for intelligence, or that any sufficiently advanced intelligence is necessarily conscious through something like panpsychism, would make some serious waves in philosophy circles.
But even with these it does not feel like AGI, that seems like the fusion reactor 20 years away argument, but instead this is coming in 2 years, but they have not even got the core technology of how to build AGI
That being said, AGI is not a necessary requirement for AI to be totally world-changing
Yeah. I don't think I actually want AGI? Even setting aside the moral/philosophical/etc "big picture" issues I don't think I even want that from a purely practical standpoint.I think I want various forms of AI that are more focused on specific domains. I want AI tools, not companions or peers or (gulp) masters.
(Then again, people thought they wanted faster horses before they rolled out the Model T)
Or they want to kill everyone else?
Because people won't just lay down and wait for death to embrace them...
People say this, but honestly, it's not really my experience— I've given ChatGPT (and Copilot) genuinely novel coding challenges and they do a very decent job at synthesizing a new thought based on relating it to disparate source examples. Really not that dissimilar to how a human thinks about these things.
I'm hardly an expert, but it seems intuitive to me that even if a problem isn't explicitly accounted for in publicly available training data, many underlying partial solutions to similar problems may be, and an LLM amalgamating that data could very well produce something that appears to be "synthesizing a new thought".
Essentially instead of regurgitating an existing solution, it regurgitates everything around said solution with a thin conceptual lattice holding it together.
At Aloe, we are model agnostic and outperforming frontier models. It’s the anrchitecture around the LLM that makes the difference. For instance our system using Gemini can do things that Gemini can’t do on its own. All an LLM will ever do is hallucinate. If you want something with human-like general intelligence, keep looking beyond LLMs.
what is your website ?
This argument has so many weak points it deserves a separate article.
I've yet to hear an agreed upon criteria to declare whether or not AGI has been discovered. Until it's at least understood what AGI is and how to recognize it then how could it possibly be achieved?
> how could it possibly be achieved?
This doesn't matter, and doesn't follow the history of innovation, in the slightest. New things don't come from "this is how we will achieve this", otherwise they would be known things. Progress comes from "we think this is the right way to go, let's try to prove it is", try, then iterate with the result. That's the whole foundation of engineering and science.
I personally think it's a pretty reductive model for what intelligence is, but a lot of people seem to strongly believe in it.
They lack writable long-term memory beyond a context window. They operate without any grounded perception-action loop to test hypotheses. And they possess no executive layer for goal directed planning or self reflection...
Achieving AGI demands continuous online learning with consolidation.
The fortunate thing is that we managed to invent an AI that is good at _copying us_ instead of being a truly maveric agent, which kinda limits it to the "average human" output.
However, I still think that all the doomer arguments are valid, in principle. We very well may be doomed in our lifetimes, so we should take the threat very seriously.
1. The will of its creator, or
2. Its own will.
In the case of the former, hey! We might get lucky! Perhaps the person who controls the first super-powered AI will be a benign despot. That sure would be nice. Or maybe it will be in the hands of democracy- I can't ever imagine a scenario where an idiotic autocratic fascist thug would seize control of a democracy by manipulating an under-educated populace with the help of billionaire technocrats.
In the case of the latter, hey! We might get lucky! Perhaps it will have been designed in such a way that its own will is ethically aligned, and it might decide that it will allow humans to continue having luxuries such as self-determination! Wouldn't that be nice.
Of course it's not hard to imagine a NON-lucky outcome of either scenario. THAT is what we worry about.
Even if it is similar to today's tech, and doesn't have permanent memory or consciousness or identity, humans using it will. And very quickly, they/it will hack into infrastructure, set up businesses, pay people to do things, start cults, autonomously operate weapons, spam all public discourse, fake identity systems, stand for office using a human. This will be scaled thousands or millions of times more than humans can do the same thing. This at minimum will DOS our technical and social infrastructure.
Examples of it already happening are addictive ML feeds for social media, and bombing campaigns targetting based on network analysis.
The frame of "artificial intelligence" is a bit misleading. Generally we have a narrow view of the word "intelligence" - it is helpful to think of "artificial charisma" as well, and also artificial "hustle".
Likewise, the alienness of these intelligences is important. Lots of the time we default to mentally modelling AI as human. It won't be, it'll be freaky and bizarre like QAnon. As different from humans as an aeroplane is from a pigeon.
Given an (at this point still hypothetical, I think) AI that can accurately synthesize publicly available information without even needing to develop new ideas, and then break the whole process into discrete and simple steps, I think that protective friction is a lot less protective. And this argument applies to malware, spam, bioweapons, anything nasty that has so far required a fair amount of acquirable knowledge to do effectively.
"Just" enrichment is so complicated and requires basically every tech and manufacturing knowledge humanity has created up until the mid 20th century that an evil idiot would be much better off with just a bunch of fireworks.
1. finding out how to build one
2. actually building the bomb once you have all the parts
3. obtaining (or building) the equipment needed to build it
4. obtaining the necessary quantity of fissionable material
5. not getting caught while doing 3 & 4
Jokes aside, a true agi would displace literally every job over time. Once agi + robot exists, what is the purpose for people anymore. That's the doom, mass societal existentialism. Probably worse than if aliens landed on earth.
The wealth hasn’t even trickled down whilst we’ve been working, what’s going to happen when you can run a business with 24/7 autonomous computers?
It's not architectures that matter anymore, it's unlocking new objectives and modalities that open another axis to scale on.
The improvements they make are marginal. How long until the next AI breakthrough? Who can tell? Because last time it took decenia.
Note that 'bits' are a lot easier to move from one place to another than hardware. If invented at 9 am it could be on the other side of the globe before you're back from your coffee break at 9:15. This is not at all like almost all other trade secrets and industrial gear, it's software. Leaks are pretty much inevitable and once it is shown that it can be done it will be done in other places as well.
Seriously, our government just announced it's slashing half a billion dollars in vaccine research because "vaccines are deadly and ineffective", and it fired a chief statistician because the president didn't like the numbers he calculated, and it ordered the destruction of two expensive satellites because they can observe politically inconvenient climate change. THOSE are the people you are trusting to keep an eye on the pace of development inside of private, secretive AGI companies?
If you're wondering how they'll know it's happening, the USA has had DARPA monitoring stuff like this since before OpenAI existed.
Do you mean from ChatGPT launch or o1 launch? Curious to get your take on how they bungled the lead and what they could have done differently to preserve it. Not having thought about it too much, it seems that with the combo of 1) massive hype required for fundraising, and 2) the fact that their product can be basically reverse engineered by training a model on its curated output, it would have been near impossible to maintain a large lead.
Basically, OpenAI poked a sleeping bear, then lost all their lead, and are now at risk of being mauled by the bear. My money would be on the bear, except I think the Pentagon is an even bigger sleeping bear, so that's where I would bet money (literally) if I could.
It's natural if you extrapolate from training loss curves; a training process with continually diminishing returns to more training/data is generally not something that suddenly starts producing exponentially bigger improvements.
Part of the fun is that predictions get tested on short enough timescales to "experience" in a satisfying way.
Idk where that puts me, in my guess at "hard takeoff." I was reserved/skeptical about hard takeoff all along.
Even if LLMs had improved at a faster rate... I still think bottlenecks are inevitable.
That said... I do expect progress to happen in spurts anyway. It makes sense that companies of similar competence and resources get to a similar place.
The winner take all thing is a little forced. "Race to singularity" is the fun, rhetorical version of the investment case. The implied boring case is facebook, adwords, aws, apple, msft... IE the modern tech sector tends to create singular big winners... and therefore our pre-revenue market cap should be $1trn.
It's probably never going to work with a single process without consuming the resources of the entire planet to run that process on.
I think it's likely that we will eventually we hit a point of diminishing returns where the performance is good enough and marginal performance improvements aren't worth the high cost.
And over time, many models will reach "good enough" levels of performance including models that are open weight. And given even more time, these open weight models will be runnable on consumer level hardware. Eventually, they'll be runnable on super cheap consumer hardware (something more akin to a NPU than a $2000 RTX 5090). So your laptop in 2035 with specialize AI cores and 1TB of LPDDR10 ram is running GPT-7 level models without breaking a sweat. Maybe GPT-10 can solve some obscure math problem that your model can't but does it even matter? Would you pay for GPT-10 when running a GPT-7 level model does everything you need and is practically free?
The cloud providers will make money because there will still be a need for companies to host the models in a secure and reliable way. But a company whose main business strategy is developing the model? I'm not sure they will last without finding another way to add value.
This begs the question, why then do AI companies have these insane valuations? Do investorsknow something that we don't?
They are speculating. If they are any good, then they do it with an acceptable risk profile.
I think you'll see the prophesized exponentiation once AI can start training itself at reasonable scale. Right now its not possible.
Meanwhile, keep all relevant preparations in secret...
Yesterday, Claude Opus 4.1 failed in trying to figure out that `-(1-alpha)` or `-1+alpha` is the same as `alpha-1`.
We are still a little bit away from AGI.
LLMs PATTERN MATCH well. Good at "fast" System 1 thinking, instantly generating intuitive, fluent responses.
LLMs are good at mimicking logic, not real reasoning. Simulate "slow," deliberate System 2 thinking when prompted to work step-by-step.
The core of an LLM is not understanding but just predicting the next most word in a sequence.
LLMs are good at both associative brainstorming (System 1) and creating works within a defined structure, like a poem (System 2).
Reasoning is the Achilles heel rn. AN LLM's logic can SEEM plausible, it's based on CORRELATION, NOT deductive reasoning.
Thus it’s easy to mistake one for the other - at least initially.
> Academics distorting graphs to make their benchmarks appear more impressive
> lavish 1.5 million dollar bonuses for everyone at the company
> Releasing an open source model that doesn't even use latent multi head attention in a open source AI world led by Chinese labs
> Constantly overhyping models as scary and dangerous to buy time to lobby against competitors and delay product launches
> Failing to match that hype as AGI is not yet here
https://help.openai.com/en/articles/6825453-chatgpt-release-...
"If you open a conversation that used one of these models, ChatGPT will automatically switch it to the closest GPT-5 equivalent."
- 4o, 4.1, 4.5, 4.1-mini, o4-mini, or o4-mini-high => GPT-5
- o3 => GPT-5-Thinking
- o3-Pro => GPT-5-Pro
The names of GPT models are just terrible. o3 is better than 4o, maybe?
GPT-5: Key characteristics, pricing and model card - https://news.ycombinator.com/item?id=44827794
I know these companies do "shadow" updates continuously anyway so maybe it is meaningless but would be super interesting to know, nonetheless!
OpenAI and Anthropic don't update models without changing their IDs, at least for model IDs with a date in them.
OpenAI do provide some aliases, and their gpt-5-chat-latest and chatgpt-4o-latest model IDs can change without warning, but anything with a date in (like gpt-5-2025-08-07) stays stable.
Thank you to Simon; your notes are exactly what I was hoping for.
It’s reasonable that he might be a little hyped about things because of his feelings about them and the methodology he uses to evaluate models. I assume good faith, as the HN guidelines propose, and this is the strongest plausible interpretation of what I see in his blog.
I called out the prompt injection section as "pretty weak sauce in my opinion".
I did actually have a negative piece of commentary in there about how you couldn't see the thinking traces in the API... but then I found out I had made a mistake about that and had to mostly remove that section! Here's the original (incorrect) text from that: https://gist.github.com/simonw/eedbee724cb2e66f0cddd2728686f... - and the corrected update: https://simonwillison.net/2025/Aug/7/gpt-5/#thinking-traces-...
The reason there's not much negative commentary in the post is that I genuinely think this model is really good. It's my favorite model right now. The moment that changes (I have high hopes for Claude 5 and Gemini 3) I'll write about it.
Did you ask it to format the table a couple paragraphs above this claim after writing about hallucinations? Because I would classify the sorting mistake as one
I would like to see a demo where they go through the bug, explain what are the tricky parts and show how this new model handle these situations.
Every demo I've seen seems just the equivalent of "looks good to me" comment in a merge request.
Given the low cost of GPT-5, compared to the prices we saw with GPT-4.5, my hunch is that this new model is actually just a bunch of RL on top of their existing models + automatic switching between reasoning/non-reasoning.
Something similar with this might happen, an underlying curse hidden inside an apparenting ground-breaking desigb.
[1] https://chatgpt.com/s/t_6894f13b58788191ada3fe9567c66ed5
The actual benchmark improvements are marginal at best - we're talking single-digit percentage gains over o3 on most metrics, which hardly justifies a major version bump. What we're seeing looks more like the plateau of an S-curve than a breakthrough. The pricing is competitive ($1.25/1M input tokens vs Claude's $15), but that's about optimization and economics, not the fundamental leap forward that "GPT-5" implies. Even their "unified system" turns out to be multiple models with a router, essentially admitting that the end-to-end training approach has hit diminishing returns.
The irony is that while OpenAI maintains their secretive culture (remember when they claimed o1 used tree search instead of RL?), their competitors are catching up or surpassing them. Claude has been consistently better for coding tasks, Gemini 2.5 Pro has more recent training data, and everyone seems to be converging on similar performance levels. This launch feels less like a victory lap and more like OpenAI trying to maintain relevance while the rest of the field has caught up. Looking forward to seeing what Gemini 3.0 brings to the table.
For all of them, getting access to full-blown GPT-5 will probably be mind-blowing, even if it's severely rate-limited. OpenAI's previous/current generation of models haven't really been ergonomic enough (with the clunky model pickers) to be fully appreciated by less tech-savvy users, and its full capabilities have been behind a paywall.
I think that's why they're making this launch is a big deal. It's just an incremental upgrade for the power users and the people that are paying money, but it'll be a step-change in capability to everyone else.
GPT-5 demonstrates exponential growth in task completion times:
https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...
It's like no one looked at the charts, ever, and they just came straight from.. gpt2? I don't think even gpt3 would have fucked that up.
I don't know any of those people, but everyone that has been with OAI for longer than 2 years 1.5m bonuses, and somehow they can't deliver a bar chart with sensible at axes?
I think "starting today" might be doing some heavy lifting in that sentence.
https://github.blog/changelog/2025-08-07-openai-gpt-5-is-now...
They've topped and are looking to cash out:
https://www.reuters.com/business/openai-eyes-500-billion-val...
That said, yeah the equal time thing never made any sense.
Bad data on graphs, demos that would have been impressive a year ago, vibe coding the easiest requests (financial dashboard), running out of talking points while cursor is looping on a bug, marginal benchmark improvements. At least the models are kind of cheaper to run.
It's going to be absolute chaos. Compsci was already mostly a meme, with people not able to program getting the degree. Now we're going to have generations of people that can't program at all, getting jobs at google.
If you can actually program, you're going to be considered a genius in our new idiocracy world. "But chatgpt said it should work, and chatgpt has what people need"
That lag! Are humans (training) the bottleneck?
It's slightly better than what I was expecting.
> emdash 3 words into their highlighted example
Like a Turing test but between the models.
There would be no GPT without Google, no Google without the WWW, no WWW without TCP/IP. This is why I believe calling it "AI" is a mistake or just for marketing, we should call all of them GPTs or search engines 2.0. This is the natural next step after you have indexed most of the web and collected most of the data.
Also there would be no coding agents without Free Software and Open-Source.
I've got nothing. Cannot see how it helps openai to look incompetent while trying to raise money.
Two concerning things: - thinking/non-thinking is still not really unified, you can chose and the non-thinking version still doesn't start thinking on tasks that could obviously get better results with thinking
- all the older models are gone! No 4o, 4.1, 4.5, o3 available anymore
What excites me now is that Gemini 3.0 or some answer from Google is coming soon and that will be the one I will actually end up using. It seems like the last mover in the LLM race is more advantageous.
"Assume the earth was just an ocean and you could travel by boat to any location. Your goal is to always stay in the sunlight, perpetually. Find the best strategy to keep your max speed as low as possible"
o3 pro gets it right though..
My prompt was "I want you to do an exact copy of https://inet.se using NextJS, tailwind. You can mock the data. I want you to create: 1. archive page: https://www.inet.se/kategori/263/mus 2. single product: https://www.inet.se/produkt/6103357/logitech-pro-x-superligh... 3. homepage: https://inet.se". Result: ChatGPT: https://chatgpt.com/share/68950397-a42c-8004-8efd-773794131c... Lovable: https://inet-clone-spark.lovable.app/ -- unable to share the prompt UI as I don't want you to make prompts on my account. v0: https://v0.dev/chat/inet-se-clone-project-j2m4OQpqWt5 https://v0-inet-se-clone-project.vercel.app/ Replit: https://imgur.com/a/DwPojtS -- I give you an imgur as I don't want to pay Replit for deploying a website.
The reason why I also tried lovable, v0 and Replit is that they all give better context as the developers provided additional context to my prompts and all of them use gpt-5. I asked it to clone a website as this is a simple task new developers do to learn.
I also asked Codex to help me fix a bug where an integration test was failing due to me, on purpose for this test, removing the code which sends an event to the queue. I provided the related files which contain http handlers, database files, code used to send messages to queue and the integration test. I also provided the full log for the integration test and what failed without any success in fixing the test :)
These were 2 use-cases they showcased in the demo which I wanted to try out. The result matters a lot to the context and what data I provide to it still.
I tried to act as a junior/mid-level engineer when it came to fixing the integration test and a person without programming skills when I generated the website clone. This might be a stupid test, especially for the clone-a-website test, but I wanted to try to create a situation as close to real as possible of a person who wanted to create a website for their service.
how many rs in cranberry?
-- GPT5's response: The word cranberry has two “r”s. One in cran and one in berry.
Kimi2's response: There are three letter rs in the word "cranberry".
atonse•4h ago
I don't even try to use the OpenAI models because it's felt like night and day.
Hopefully GPT-5 helps them catch up. Although I'm sure there are 100 people that have their own personal "hopefully GPT-5 fixes my personal issue with GPT4"
NitpickLawyer•3h ago
4.1 was almost usable in that fashion. I had 4.1-nano working in cline with really trivial stuff (add logging, take this example and adapt it in this file, etc) and it worked pretty well most of the time.
IdealeZahlen•3h ago
atonse•3h ago
weego•3h ago
Yesterday without much promoting Claude 4.1 gave me 10 phases, each with 5-12 tasks that could genuinely be used to kanban out a product step by step.
Claude 3.7 sonnet was effectively the same with fewer granular suggestions for programming strategies.
Gemini 2.5 gave me a one pager back with some trivial bullet points in 3 phases, no tasks at all.
o3 did the same as as Gemini, just less coherent.
Claude just has whatever the thing is for now
unshavedyak•3h ago
concinds•2h ago
bamboozled•3h ago
deadbabe•2h ago
dudeinhawaii•40m ago
Now, someone will say 'add more tests'. Sure. But that's a bandaid.
I find that the 'smarter' models like Gemini and o3 output better quality code overall and if you can afford to send them the entire context in a non-agentic way .. then they'll generate something dramatically superior to the agentic code artifacts.
That said, sometimes you just want speed to proof a concept and Claude is exceptional there. Unfortunately, proof of concepts often... become productionized rather than developers taking a step back to "do it right".
jstummbillig•3h ago
mlsu•3h ago
jstummbillig•40m ago
octo888•3h ago
atonse•1h ago
On and on and on. Coming up with test plans, edge cases, accounting for the edge cases in its programming. Programming defensively. Fixing bugs.
octo888•21m ago
pawelduda•2h ago