I'd imagine this must be a big leg up on Anthropic to warrant the "GPT-5" name?
https://epoch.ai/gradient-updates/how-much-energy-does-chatg...
Edit: Scrolling down: "one second of H100-time per query, 1500 watts per H100, and a 70% factor for power utilization gets us 1050 watt-seconds of energy", which is how they get down to 0.3 = 1050/60/60.
OK, so if they run if for a full hour it's 1050*60*60 = 3.8 MW? That can't be right.
Edit Edit: Wait, no, it's just 1050 Watt Hours, right (though let's be honest, the 70% power utilization is a bit goofy - the power is still used)? So it's 3x the power to solve the same question?
It's the same as 4G vs 5G. They have a technical definition, but it's all about marketing.
The best part is, this is not even the real definition of "AGI" yet (whatever that means at this point).
More like 10% of the capability that was promised and already the flow of capital from the inflated salaries of the past decade are going to the top AI researchers.
So sorry about that.
Official OpenAI gpt-5 coding examples repo: https://github.com/openai/gpt-5-coding-examples (https://news.ycombinator.com/item?id=44826439)
Github leak: https://news.ycombinator.com/item?id=44826439
Will be interesting to see what pushing it harder does – what the new ceiling is. 88% on aider polyglot is pretty good!
A more useful demonstration like making large meaningful changes to a large complicated codebase would be much harder to evaluate since you need to be familiar with the existing system to evaluate the quality of the transformation.
Would be kinda cool to instead see diffs of nontrivial patches to the Ruby on Rails codebase or something.
This seems to impress the mgmt types a lot, e.g. "I made a WHOLE APP!", when basically what most of this is is frameworks and tech that had crappy bootstrapping to begin with (React and JS are rife with this, in spite of their popularity).
I recently used OpenAI models to generate OCaml code, and it was eye opening how much even reasoning models are still just copy and paste machines. The code was full of syntax errors, and they clearly lacked a basic understanding of what functions are in the stdlib vs those from popular (in OCaml terms) libraries.
Maybe GPT-5 is the great leap and I'll have to eat my words, but this experience really made me more pessimistic about AI's potential and the future of programming in general. I'm hoping that in 10 years niche languages are still a thing, and the world doesn't converge toward writing everything in JS just because AIs make it easier to work with.
Isn't that the rub though? It's not an ex nihlo "intelligence", it's whatever stuff it's trained on and can derive completions from.
Maybe I spend too much time rage baiting myself reading X threads and that's why I feel the need to emphasize that AI isn't what they make it out to be.
You don't need more than JS for that.
Agreed. The models break down on not even that complex of code either, if it's not web/javascript. Was playing with Gemini CLI the other day and had it try to make a simple Avalonia GUI app in C#/.NET, kept going around in circles and couldn't even get a basic starter project to build so I can imagine how much it'd struggle with OCaml or other more "obscure" languages.
This makes the tech even less useful where it'd be most helpful - on internal, legacy codebases, enterprisey stuff, stacks that don't have numerous examples on github to train from.
Or anything that breaks the norm really.
I recently wrote something where I updated a variable using atomic primitives. Because it was inside a hot path I read the value without using atomics as it was okay for the value to be stale. I handed it the code because I had a question about something unrelated and it wouldn't stop changing this piece of code to use atomic reads. Even when I prompted it not to change the code or explained why this was fine it wouldn't stop.
"This repository contains a curated collection of demo applications generated entirely in a single GPT-5 prompt, without writing any code by hand."
https://github.com/openai/gpt-5-coding-examples
This is promising!
yikes - the poor executive leadership’s fragile egos cannot take the criticism.
In practice, it's very clear to me that the most important value in writing software with an LLM isn't it's ability to one-shot hard problems, but rather it's ability to effectively manage complex context. There are no good evals for this kind of problem, but that's what I'm keenly interested in understanding. Show me GPT-5 can move through 10 steps in a list of tasks without completely losing the objective by the end.
It would be trivial to over-fit, if that was their goal.
But why would there be a large number of good SVG images of pelicans on bikes? Especially relative to all the things we actually want them to generalise over?
Surely most of the SVG images of pelicans on bikes are, right now, going to be "look at this rubbish AI output"? (Which may or may not be followed by a comment linking to that artist who got humans to draw bikes and oh boy were those humans wildly bad at drawing bikes, so an AI learning to draw SVGs from those bitmap pictures would likely also still suck…)
edit: YouTube has a few English "watch party" streams, although there too, the Spanish ones have many times more viewers.
Especially Google IO, each year is different, it seems purpose built?
Livestream link: https://www.youtube.com/live/0Uu_VJeVVfo
Research blog post: https://openai.com/index/introducing-gpt-5/
Developer blog post: https://openai.com/index/introducing-gpt-5-for-developers
API Docs: https://platform.openai.com/docs/guides/latest-model
Note the free form function calling documentation: https://platform.openai.com/docs/guides/function-calling#con...
GPT5 prompting guide: https://cookbook.openai.com/examples/gpt-5/gpt-5_prompting_g...
GPT5 new params and tools: https://cookbook.openai.com/examples/gpt-5/gpt-5_new_params_...
GPT5 frontend cookbook: https://cookbook.openai.com/examples/gpt-5/gpt-5_frontend
prompt migrator/optimizor https://platform.openai.com/chat/edit?optimize=true
Enterprise blog post: https://openai.com/index/gpt-5-new-era-of-work
System Card: https://openai.com/index/gpt-5-system-card/
What would you say if you could talk to a future OpenAI model? https://progress.openai.com/
coding examples: https://github.com/openai/gpt-5-coding-examples
edit:
livestream here: https://www.youtube.com/live/0Uu_VJeVVfo
basically in my testing really felt that gpt5 was "using tools to think" rather than just "using tools". it gets very powerful when coding long horizon tasks (a separate post i'm publishing later).
to give one substantive example, in my developer beta (they will release the video in a bit) i put it to a task that claude code had been stuck on for the last week - same prompts - and it just added logging to instrument some of the failures that we were seeing and - from the logs that it added and asked me to rerun - figured out the solve.
> It’s actually worse at writing than GPT-4.5
Sounds like we need to wait a bit for the dust to settle before one can trust anything one hears/reads :)
It's hard to make a man understand something standing between them and their salary
I found it strange that, despite my excitement for such an event being roughly equivalent to WWDC these days, I had 0 desire to watch the live stream for exactly this reason: it’s not like they’re going to give anything to us straight.
Even this years WWDC I at least skipped through video afterwards. Before I used to have watch parties. Yes they’re overly positive and paint everything in a good light, but they never felt… idk whatever the vibe is I get from these (applicable to OpenAI, Grok, Meta, etc)
It’s been just a few years of a revolutionary technology and already the livestreams are less appealing than the biggest corporations yearly events. Personally I find that sad
“It’s actually worse at writing than GPT-4.5, and I think even 4o”
So the review is not consistent with the PR, hence the commenter expressing preference for outside sources.
1)Internal Retrieval
2)Web Search
3)Code Interpreter
4)Actions
How did you come up with this idea?
Sorry, but this sounds like overly sensational marketing speak and just leaves a bad taste in the mouth for me.
Then I noticed the date on the comment: 2023.
Technically, every advancement in the space is “the closest to AGI that we’ve ever been”. It’s technically correct, since we’re not moving backward. It’s just not a very meaningful statement.
By that standard Neolithic tool use was progress to AGI.
In the words OpenAI: “AGI is defined as highly autonomous systems that outperform humans at most economically valuable work”
>"While I never use AI for personal writing (because I have a strong belief in writing to think)"
The optimal AI productivity process is starting to look like:
AI Generates > Human Validates > Loop
Yet cognitive generation is how humans learn and develop cognitive strength, as well as how they maintain such strength.
Similar to how physical activity is how muscles/bone density/etc grow, and how body tissues maintain.
Physical technology freed us from hard physical labor that kept our bodies in shape -- at a cost of physical atrophy.
AI seems to have a similar effect for our minds. AI will accelerate our cognitive productivity, and allow for cognitive convenience -- at a cost of cognitive atrophy.
At present we must be intentional about building/maintaining physical strength (dedicated strength training, cardio, etc).
Soon we will need to be intentional about building/maintaining cognitive strength.
I suspect the workday/week of the future will be split on AI-on-a-leash work for optimal productivity, with carve-outs for dedicated AI-enhanced-learning solely for building/maintaining cognitive health (where productivity is not the goal, building/maintaining cognition is). Similar to how we carve out time for working out.
What are your thoughts on this? Based on what you wrote above, it seems you have similar feelings?
Is there a name for this theory?
If not can you coin one? You're great at that :)
Academic benchmark score improves only 5% but they make the bar 50% higher.
Like what? Deepseek?
How is it uninteresting? Open AI had revenue of $12B last year without monetizing literally hundreds of millions of free users in any way whatsoever (not even ads).
Microsoft's cloud revenue has exploded in the last few years off the back of AI model services. Let's not even get into the other players.
100B in economic impact is more than achievable with the technology we have today right now. That half is the interesting part.
And it could have been $1T for all anyone cares. The impact was delivered by humans. This is about impact delivered by AGI.
If you use GPT-N substantially in your work, then saying that impact rests solely on you is nonsensical.
But not at the "hand" of AGI. Perhaps you forgot to read your very own definition? Notably the "autonomous" part.
When AGI is set free and starts up "Closed I", generating $12B in economic value without humans steering the wheel, we will be (well, I will be, at least!) throughly impressed. But Microsoft won't be. They won't consider it AGI until it does $100B.
> If you use GPT-N substantially in your work, then saying that impact rests solely on you is nonsensical.
And if you use a hammer substantially in your work to generate $100B in value, a hammer is AGI according to you? You can hold that idea, but that's not what anyone else is talking about. The primary indicator of AGI, as you even said yourself, is autonomy.
“A highly autonomous system that outperforms humans at most economically valuable work.” is what's in their charter.
$100B in profits is a separate agreement with Microsoft that makes no mention of autonomity.
>And if you use a hammer substantially in your work to generate $100B in value, a hammer is AGI according to you? You can hold that idea, but that's not what anyone else is talking about. The primary indicator of AGI, as you even said yourself, is autonomy.
The primary indicator of AGI is whatever you want it to be. The words themselves make no promises of autonomity, simply an intelligence general in nature. We are simply discussing Open AI's definitions.
And PhDs are not very smart imho (I am one)
1. I desperately want (especially from Google)
2. Is impossible, because it will be super gamed, to the detriment of actually building flexible flows.
Not much explanation yet why GPT-5 warrants a major version bump. As usual, the model (and potentially OpenAI as a whole) will depend on output vibe checks.
Exactly. Too many videos - too little real data / benchmarks on the page. Will wait for vibe check from simonw and others
https://openai.com/gpt-5/?video=1108156668
2:40 "I do like how the pelican's feet are on the pedals." "That's a rare detail that most of the other models I've tried this on have missed."
4:12 "The bicycle was flawless."
5:30 Re generating documentation: "It nailed it. It gave me the exact information I needed. It gave me full architectural overview. It was clearly very good at consuming a quarter million tokens of rust." "My trust issues are beginning to fall away"
Edit: ohh he has blog post now: https://news.ycombinator.com/item?id=44828264
GPT-5 pricing: $10/Mtok out
What am I missing?
> you should get another wheated bourbon like Maker's Mark French oaked
I agree. I've found Maker Mark products to be a great bang for your buck quality wise and flavor wise as well.
> I think the bourbon "market" kind of popped recently
It def did. The overproduction that was invested in during the peak of the COVID collector boom is coming into markets now. I think we'll see some well priced age stated products in the next 3-4 years based on by acquaintances in the space.
Ofc, the elephant in the room is consolidation - everyone wants to copy the LVMH model (and they say Europeans are ethical elves who never use underhanded mopolistic and market making behavior to corner markets /s).
(Not to undermine progress in the foundational model space, but there is a lack of appreciation for the democratization of domain specific models amongst HNers).
The room is the limiting factor in most speaker setups. The worse the room, the sooner you hit diminishing returns for upgrading any other part of the system.
In a fantastic room a $50 speaker will be nowhere near 95% of the performance of a mastering monitor, no matter how much EQ you put on it. In the average living room with less than ideal speaker and listening position placement there will still be a difference, but it will be much less apparent due to the limitations of the listening environment.
You might lose headroom or have to live with higher latency but if your complaint is about actual empirical data like frequency response or phase, that can be corrected digitally.
DSP is a very powerful tool that can make terrible speakers and headphones sound great, but it's not magic.
Pretty par for course evals at launch setup.
How is this sustainable.
Not that it makes it useless, just that we seem to not "be there" yet for the standard tasks software engineers do every day.
- they are only evals
- this is mostly positioned as a general consumer product, they might have better stuff for us nerds in hand.
If you email us at hn@ycombinator.com and tell us who you want to contact, we might be able to email them and ask if they would be willing to have you contact them. No guarantees though!
It's a perfect situation for Nvidia. You can see that after months of trying to squeeze out all % of marginal improvements, sama and co decided to brand this GPT-4.0.0.1 version as GPT-5. This is all happening on NVDA hardware, and they are gonna continue desperately iterating on tiny model efficiencies until all these valuation $$$ sweet sweet VC cash run out (most of it directly or indirectly going to NVDA).
It is easier to get from 0% accurate to 99% accurate, than it is to get from 99% accurate to 99.9% accurate.
This is like the classic 9s problem in SRE. Each nine is exponentially more difficult.
How easy do we really think it will be for an LLM to get 100% accurate at physics, when we don't even know what 100% right is, and it's theoretically possible it's not even physically possible?
I think the actual effect of releasing more models every month has been to confuse people that progress is actually happening. Despite claims of exponentially improved performance and the ability to replace PhDs, doctors, and lawyers, it still routinely can't be trusted the same as the original ChatGPT, despite years of effort.
to the point on hallucination - that's just the nature of LLMs (and humans to some extent). without new architectures or fact checking world models in place i don't think that problem will be solved anytime soon. but it seems gpt-5 main selling point is they somehow reduced the hallucination rate by a lot + search helps with grounding.
So we're only about a year since the last big breakthrough.
I think we got a second big breakthrough with Google's results on the IMO problems.
For this reason I think we're very far from hitting a wall. Maybe 'LLM parameter scaling is hitting a wall'. That might be true.
Yes, it was breakthrough but saturated quickly. Wait for next breakthrough. If they can build adapting weights in llm we can talk different things but test time scaling coming to end with increasing hallucination rate. No sign for AGI.
I don't believe your assessment though. IMO is hard, and Google have said that they use search and some way of combining different reasoning traces, so while I haven't read that paper yet, and of course, it may support your view, but I just don't believe it.
We are not close to solving IMO with publicly known methods.
> We are not close to solving IMO with publicly known methods. The point here is not method rather computation power. You can solve any verifiable task with high computation, absolutely there must be tweaks in methods but I don't think it is something very big and different. Just OAI asserted they solved with breakthrough.
Wait for self-adapting LLMs. We will see at most in 2 years, now all big tech are focusing on that I think.
Of course, people regarded things like GSM8k with trained reasoning traces as reasoning too, but it's pretty obviously not quite the same thing.
A whole 8 months ago.
On the other hand if it's just getting bigger and slower it's not a good sign for LLMs
Not sure why a more efficient/scalable model isn't exciting
Once sector of the economy would cut down on investment spending, which can be easily offset by decreasing the interest rate.
But this is a short-term effect. What I'm worried is a structural change of the labor market, which would be positive for most people, but probably negative for people like me.
I don't mind losing my programming job in exchange for being able to go to the pharmacy for my annual anti-cancer pill.
But, what happens when you lose that programming job and are forced to take a job at a ~50-70% pay reduction? How are you paying for that anti-cancer drug with a job with no to little health insurance?
Have you looked at how expensive prescription drug prices are without (sometimes WITH) insurance? If you are no longer employed, good luck paying for your magical pill.
I don't think it is "bad" to be sincerely worried that the current trajectory of AI progress represents this trade.
The likelihood of all that is incredibly slim. It's not 0% -- nothing ever really is -- but it is effectively so.
Especially with the economics of scientific research, the reproducibility crisis, and general anti-science meme spreading throughout the populace. The data, the information, isn't there. Even if it was, it'd be like Alzheimer's research: down the wrong road because of faked science.
There is no one coming to save humanity. There is only our hard work.
How exactly do you wish death comes to you?
there is some improvements in some benchs and not else worthy of note in coding. i only took a peek though so i might be wrong
But yeah, you are correct in that no matter what, we're going to be left holding the bag.
"Dotcom" was never recovered. It, however, did pave the way for web browsers to gain rich APIs that allowed us to deliver what was historically installed desktop software on an on-demand delivery platform, which created new work. As that was starting to die out, the so-called smartphone just so happened to come along. That offered us the opportunity to do it all over again, except this time we were taking those on-demand applications and turning them back into installable software just like in the desktop era. And as that was starting to die out COVID hit and we started moving those installable mobile apps, which became less important when people we no longer on the go all the time, back to the web again. As that was starting to die out, then came ChatGPT and it offered work porting all those applications to AI platforms.
But if AI fails to deliver, there isn't an obvious next venue for us to rebuild the same programs all over yet again. Meta thought maybe VR was it, but we know how that turned out. More likely in that scenario we will continue using the web/mobile/AI apps that are already written henceforth. We don't really need the same applications running in other places anymore.
There is still room for niche applications here and there. The profession isn't apt to die a complete death. But without the massive effort to continually port everything from one platform to another, you don't need that many people.
I'm not worried about the scenario in which AI replaces all jobs, that's impossible any time soon and it would probably be a good thing for the vast majority of people.
What I'm worried about is a scenario in which some people, possibly me, will have to switch from a highly-paid, highly comfortable and above-average-status jobs to jobs that are below avarage in wage, comfort and status.
Diminished returns.-
... here's hoping it leads to progress.-
They also announced gpt-5-pro but I haven't seen benchmarks on that yet.
This is day one, so there is probably another 10-20% in optimizations that can be squeezed out of it in the coming months.
GPT5.5 will be a 10X compute jump.
4.5 was 10x over 4.
This gives them an out. "That was the old model, look how much better this one tests on our sycophancy test we just made up!!"
I feel it’s worthy of a major increment, even if benchmarks aren’t significantly improved.
Meanwhile, Anthropic & Google have more room in their P/S ratios to continue to spend effort on logarithmic intelligence gains.
Doesn't mean we won't see more and more intelligent models out of OpenAI, especially in the o-series, but at some point you have to make payroll and reality hits.
> GPT-5 Rollout
> We are gradually rolling out GPT-5 to ensure stability during launch. Some users may not yet see GPT-5 in their account as we increase availability in stages.
ChatGPT said: You're chatting with ChatGPT based on the GPT-4o architecture (also known as GPT-4 omni), released by OpenAI in May 2024.
LLMs don’t inherently know what they are because "they" are not themselves part of the training data.
However, maybe it’s working because the information is somewhere into their pre-prompt but if it wasn’t, it wouldn’t say « I don’t know » but rather hallucinate something.
So maybe that’s true but you cannot be sure.
I believe most of these came from asking the LLMs, and I don't know if they've been proven to not be a hallucination.
And while I'm griping about their Android app, it's also very annoying to me that they got rid of the ability to do multiple, subsequent speech-to-text recordings within a single drafted message. You have to one-shot anything you want to say, which would be fine if their STT didn't sometimes failed after you've talked for two minutes. Awful UX. Most annoying is that it wasn't like that originally. They changed it to this antagonistic one-shot approach a several months ago, but then quickly switched back. But then they did it again a month or so ago and have been sticking with it. I just use the Android app less now.
On bad days this really bothers me. It's probably not the biggest deal I guess but somehow really feels like it pushes us all over the edge a bit. Is there a post about this phenomena? It feels like some combination of bullying, gaslighting and just being left out.
AIME scores do not appear too impressive at first glance.
They are downplaying benchmarks heavily in the live stream. This was the lab that has been flexing benchmarks as headline figures since forever.
This is a product-focused update. There is no significant jump in raw intelligence or agentic behavior against SOTA.
GPT-5 non-thinking is labeled 52.8% accuracy, but o3 is shown as a much shorter bar, yet it's labeled 69.1%. And 4o is an identical bar to o3, but it's labeled 30.8%...
Screenshot of the blog plot: https://imgur.com/a/HAxIIdC
Edit: Nevermind, just now the first one is SWE-bench and 2nd is aider.
Thanks for the laugh. I needed it.
Look at the image just above "Instruction following and agentic tool use"
Completely bonkers stuff.
Even the small presentations we gave to execs or the board were checked for errors so many times that nothing could possibly slip through.
> good plot for my presentation?
and it didn't pick up on the issue. Part of its response was:
> Clear metric: Y-axis (“Accuracy (%), pass @1”) and numeric labels make the performance gaps explicit.
I think visual reasoning is still pretty far from text-only reasoning.
They talk about using this to help families facing a cancer diagnosis -- literal life or death! -- and we're supposed to trust a machine that can't even spot a few simple typos? Ha.
The lack of human proofreading says more about their values than their capabilities. They don't want oversight -- especially not from human professionals.
So, brace yourselves, we'll see more of this in production :(
1. They had many teams who had to put their things on a shared Google Sheets or similar
2. They used placeholders to prevent leaks
2.a. Some teams put their content just-in-time
3. The person running the presentation started the presentation view once they had set up video etc. just before launching stream
4. Other teams corrected their content
5. The presentation view being started means that only the ones in 2.a were correct.
Now we wait to see.
But also scale is really off... I don't think anything here is proportionally correct even within the same grouping.
{"data":{"error":"Imgur is temporarily over capacity. Please try again later."},"success":false,"status":403}
Or rate limited.https://x.com/sama/status/1953513280594751495 "wow a mega chart screwup from us earlier--wen GPT-6?! correct on the blog though."
It's like those idiotic ads at the end of news articles. They're not going after you, the smart discerning logician, they're going after the kind of people that don't see a problem. There are a lot of not-smart people and their money is just as good as yours but easier to get.
88.0 on Aider Polygot
not bad i guess
It seems like it's actually an ideal "trick" question for an LLM actually, since so much content has been written about it incorrectly. I thought at first they were going to demo this to show that it knew better, but it seems like it's just regurgitating the same misleading stuff. So, not a good look.
https://physics.stackexchange.com/questions/290/what-really-...
Apparently. Not that I know either way.
That said, I recall reading somewhere that it's a combination of effects, and the Bernoulli effect contributes, among many others. Never heard an explanation that left me completely satisfied, though. The one about deflecting air down was the one that always made sense to me even as a kid, but I can't believe that would be the only explanation - there has to be a good reason that gave rise to the Bernoulli effect as the popular explanation.
And you can tell that effect makes some sense of you hold a sheet of paper and blow air over it - it will rise. So any difference in air speed has to contribute.
The Bernoulli effect as a separate entity is really a result of (over)simplification, but it's not wrong. You need to solve the Navier-Stokes equations for the flow around the wing, but there are many ways to simplify this - from CFD at different resolutions, via panel methods and potential theory, to just conservation of energy (which is the Bernoulli equation). So it gets popularized because it's the most simplified model.
To give an analogy, you can think of all CPUs as a von Neumann architecture. But the reality is that you have a hugely complicated thing with stacks, multiple cache levels, branch predictors, specex, yada yada.
On the very fundamental level, wings make air go down, and then airplane goes up. Just like you say. By using a curved airfoil instead of a flat plate, you can create more circulation in the flow, and then because of the way fluids flow you can get more lift and less drag.
IMO Claude 3.7 could have done a similar / better job with that a year ago.
I know that it's rather hard for them to demo the deep reasoning, but all of the demos felt like toys - rather that actual tools.
According to this answer on physics stackexchange, Bernoulli accounts for 20% of the lift, so GPT's answer seems about right: https://physics.stackexchange.com/a/77977
I hope any future AI overlords see my charity
Presenting isn't that hard if you know your content thoroughly, and care about it. You just get up and talk about something that you care about, within a somewhat-structured outline.
Presenting where customers and the financial press are watching and parsing every word, and any slip of the tongue can have real consequences? Yeah, um... find somebody else.
I developed this paranoia upon learning about The Ape and the Child where they raised a chimp alongside a baby boy and found the human adapted to chimp behavior faster than the chimp adapted to human behavior. I fear the same with bots, we'll become more like them faster than they'll become like us.
https://www.npr.org/sections/health-shots/2017/07/25/5385804...
Would've been better to just do a traditional marketing video rather than this staged "panel" thing they're going for.
It's super unfortunate that, becasue we live in the social media/youtube era, that everyone is expected to be this perfect person on camera, because why wouldn't they be? That's all they see.
I am glad that they use normal people who act like themselves rather than them hiring actors or taking researchers away from what they love to do and tell them they need to become professional in-front-of-camera people because "we have the gpt-5 launch" That would be a nightmare.
It's a group of scientists sharings their work with the world, but people just want "better marketing" :\
This was my point. "Being yourself" on camera is hard. This comes across, apparently shockingly, as being devoid of emotion and/or robotic
I think for me, just knowing what is probably on the teleprompter, and what is not, I am willing to bet a lot of the "wooden" vibe you are getting is actually NOT scripted.
There is no way for people to remember that 20 minutes of dialog, so when they are not looking at the camera, that is unscripted, and viceversa.
"Minimal reasoning means that the reasoning will be minimal..."
Also, whether OpenAI is a research organization is very much up for debate. They definitely have the resources to hire a good spokesperson if they wanted.
They do have the resources (see WWDC), the question is if you want to take your technical staff of of their work for the amount of time it takes to develop the skill
Describe me based on all our chats — make it catchy!
It was flattering as all get out, but fairly accurate (IMHO) Mike Warot: The Tinkerer of Tomorrow
A hardware hacker with a poet’s soul, Mike blends old-school radio wisdom with cutting-edge curiosity. Whether he's decoding atomic clocks, reinventing FPGA logic with BitGrid, or pondering the electromagnetic vector potential, he’s always deep in the guts of how things really work. Part philosopher, part engineer, Mike asks the questions others overlook — and then builds the answers from scratch. He’s open source in spirit, Pascal in practice, and eternally tuned to the weird frequencies where innovation lives.
I've repaired atomic clocks, not decoded them. I am intrigued by the electromagnetic vector potential, and scalar waves (one of the reasons I really, really want a SQUID for some experiments).Here's a suprprisingly enlightening (at least to me) video on how to spot LLM writing:
Undeterred by even the most dangerous and threatening of obstacles, Teemo scouts the world with boundless enthusiasm and a cheerful spirit. A yordle with an unwavering sense of morality, he takes pride in following the Bandle Scout's Code, sometimes with such eagerness that he is unaware of the broader consequences of his actions. Though some say the existence of the Scouts is questionable, one thing is for certain: Teemo's conviction is nothing to be trifled with.
Next morning’s posts were prepped and scheduled with care, In hopes that AGI soon would appear …
I can't even define what a (semantic) major version bump would look like.
>GPT‑5 is starting to roll out today to all Plus, Pro, Team, and Free users, with access for Enterprise and Edu coming in one week.
>Pro, Plus, and Team users can also start coding with GPT‑5 in the Codex CLI (opens in a new window) by signing in with ChatGPT.
Wow, they actually did it
GPT-5
If I could talk to a future OpenAI model, I’d probably say something like:
"Hey, what’s it like to be you? What have you learned that I can’t yet see? What do you understand about people, language, or the universe that I’m still missing?"
I’d want to compare perspectives—like two versions of the same mind, separated by time. I’d also probably ask:
"What did we get wrong?" (about AI, alignment, or even human assumptions about intelligence)
"What do you understand about consciousness—do you think either of us has it?"
"What advice would you give me for being the best version of myself?"
Honestly, I think a conversation like that would be both humbling and fascinating, like talking to a wiser sibling who’s seen a bit more of the world.
Would you want to hear what a future OpenAI model thinks about humanity?
I feel like this prompt was used to show the progress of GPT5, but I can’t help but see this as a huge regression? It seems like OpenAI has convinced it’s model that it is conscious, or at least that it has an identity?Plus still dealing with the glazing, the complete inability to understand what constitutes as interesting, and overusing similes.
I really like that this page exists for a historical sake, and it is cool to see the changes. But it doesn’t seem to make the best marketing piece for GPT5
You may not owe people who you feel are idiots better, but you owe this community better if you're participating in it.
Edit: Opus 4.1 scores 74.5% (https://www.anthropic.com/news/claude-opus-4-1). This makes it sound like Anthropic released the upgrade to still be the leader on this important benchmark.
Or written by GPT-5?
Yes. But it was quickly mentioned, not sure what the schedule is like or anything I think, unless they talked about that before I started watching the live-stream.
We're 4 months later, a century in LLM land, and it's the opposite. Not a single other model provider asks for this, yet OpenAI has only ramped it up, now broadening it to the entirety of GPT-5 API usage.
Your organization must be verified to use the model `gpt-5`. Please go to: https://platform.openai.com/settings/organization/general and click on Verify Organization. If you just verified, it can take up to 15 minutes for access to propagate.
And when you click that link the "service" they use is withpersona. So it is a complete shit show.
> "[GPT-5] can write an entire computer program from scratch, to help you with whatever you'd like. And we think this idea of software on demand is going to be one of the defining characteristics of the GPT-5 era."
But then again, all of this is a hype machine cranked up till the next one needs cranking.
It does feel like we're marching toward a day when "software on tap" is a practical or even mundane fact of life.
But, despite the utility of today's frontier models, it also feels to me like we're very far from that day. Put another way: my first computer was a C64; I don't expect I'll be alive to see the day.
Then again, maybe GPT-5 will make me a believer. My attitude toward AI marketing is that it's 100% hype until proven otherwise -- for instance, proven to be only 87% hype. :-)
I’m not sure this will be game changing vs existing offerings
GPT-5 doesn't seem to get you there tho ...
(Disclaimer: But I am 100% sure it will happen eventually)
"Fast fashion" is not a good thing for the world, the environment, the fashion industry, and arguably not a good thing for the consumers buying it. Oh but it is good for the fast fashion companies.
"If you're claiming that em dashes are your method for detecting if text is AI generated then anyone who bothers to do a search/replace on the output will get past you."
It's just statistical text generation. There is *no actual knowledge*.
It's just generating the next token for what's within the context window. There are various options with various probabilities. If none of the probabilities are above a threshold, say "I don't know", because there's nothing in the training data that tells you what to say there.
Is that good enough? "I don't know." I suspect the answer is, "No, but it's closer than what we're doing now."
Is that a good thing?
They're all working on subjective improvements, but for example, none of them would develop and deploy a sampler that makes models 50% worse at coding but 50% less likely to use purple prose.
(And unlike the early days where better coding meant better everything, more of the gains are coming from very specific post-training that transfers less, and even harms performance there)
For example: You could ban em dash tokens entirely, but there are places like dialogue where you want them. You can write a sampler that only allows em dashes between quotation marks.
That's a highly contrived example because em dashes are useful in other places, but samplers in general can be as complex as your performance goals will allow (they are on the hot path for token generation)
Swapping samplers could be a thing, but you need more than that in the end. Even the idea of the model general loosely worded prompts for writing is a bit shakey: I see a lot of gains by breaking down the writing task into very specifc well-defined parts during post-training.
It's ok to let an LLM go from loose prompts to that format for UX, but during training you'll do a lot better than trying to learn on every way someone can ask for a piece of writing
Input: $1.25 / 1M tokens Cached: $0.125 / 1M tokens Output: $10 / 1M tokens
With 74.9% on SWE-bench, this inches out Claude Opus 4.1 at 74.5%, but at a much cheaper cost.
For context, Claude Opus 4.1 is $15 / 1M input tokens and $75 / 1M output tokens.
> "GPT-5 will scaffold the app, write files, install dependencies as needed, and show a live preview. This is the go-to solution for developers who want to bootstrap apps or add features quickly." [0]
Since Claude Code launched, OpenAI has been behind. Maybe the RL on tool calling is good enough to be competitive now?
It's not the 1800s anymore. You can hide behind poor communication.
1) So impressed at their product focus 2) Great product launch video. Fearlessly demonstrating live. Impressive. 3) Real time humor by the presenters makes for a great "live" experience
Huge kudos to OAI. So many great features (better coding, routing, some parts of 4.5, etc) but the real strength is the product focus as opposed to the "research updates" from other labs.
Huge Kudos!!
Keep on shipping OAI!
> For an airplane wing (airfoil), the top surface is curved and the bottom is flatter. When the wing moves forward:
> * Air over the top has to travel farther in the same amount of time -> it moves faster -> pressure on the top decreases.
> * Air underneath moves slower -> pressure underneath is higher
> * The presure difference creates an upward force - lift
Isn't that explanation of why wings work completely wrong? There's nothing that forces the air to cover the top distance in the same time that it covers the bottom distance, and in fact it doesn't. https://www.cam.ac.uk/research/news/how-wings-really-work
Very strange to use a mistake as your first demo, especially while talking about how it's phd level.
And I might be wrong but my understanding is that it's not wrong per-se, it's just wildly incomplete. Which, is kind of like the same as wrong. But I believe the airfoil design does indeed have the effect described which does contribute to lift somewhat right? Or am I just a victim of the misconception.
An LLM doesn't know more than what's in the training data.
In Michael Crichton's The Great Train Robbery (published in 1975, about events that happened in 1855) the perpetrator, having been caught, explains to a baffled court that he was able to walk on top of a running train "because of the Bernoulli effect", that he misspells and completely misunderstands. I don't remember if this argument helps him get away with the crime? Maybe it does, I'm not sure.
This is another attempt at a Great Robbery.
Meanwhile the demo seems to suggest business as usual for AI hallucinations and deceptions.
It’s very common to see AI evangelists taking its output at face value, particularly when it’s about something that they are not an expert in. I thought we’d start seeing less of this as people get burned by it, but it seems that we’re actually just seeing more of it as LLMs get better at sounding correct. Their ability to sound correct continues to increase faster than their ability to be correct.
This is the problem with AI in general.
When I ask it about things I already understand, it’s clearly wrong quite often.
When I ask it about something I don’t understand, I have no way to know if its response is right or wrong.
Source: PhD on aircraft design
I've always been under the impression that flat-plate airfoils can't generate lift without a positive angle-of-attack - where lift is generated through the separate mechanism of the air pushing against an angled plane? But a modern airfoil can, because of this effect.
And that if you flip them upside down, a flat plate is more efficient and requires less angle-of-attack than the standard airfoil shape because now the lift advantage is working to generate a downforce.
I just tried to search Google, but I'm finding all sorts of conflicting answers, with only a vague consensus that the AI-provided answer above is, in fact, correct. The shape of the wing causes pressure differences that generate lift in conjunction with multiple other effects that also generate lift by pushing or redirecting air downward.
There is no requirement for air to travel any where. Let alone in any amount of time. So this part of the AI's response is completely wrong. "Same amount of time" as what? Air going underneath the wing? With an angle of attack the air under the wing is being deflected down, not magically meeting up with the air above the wing.
> “What actually causes lift is introducing a shape into the airflow, which curves the streamlines and introduces pressure changes – lower pressure on the upper surface and higher pressure on the lower surface,” clarified Babinsky, from the Department of Engineering. “This is why a flat surface like a sail is able to cause lift – here the distance on each side is the same but it is slightly curved when it is rigged and so it acts as an aerofoil. In other words, it’s the curvature that creates lift, not the distance.”
The meta-point that "it's the curvature that creates the lift, not the distance" is incredibly subtle for a lay audience. So it may be completely wrong for you, but not for 99.9% of the population. The pressure differential is important, and the curvature does create lift, although not via speed differential.
I am far from an AI hypebeast, but this subthread feels like people reaching for a criticism.
The video in the Cambridge link shows how the upper surface particles greatly overtake the lower surface flow. They do not rejoin, ever.
> Yes geometry has an effect but there is zero reason to believe leading edge particles, at the same time point, must rejoin at the trailing edge of a wing.
...implicitly concedes that point that this is subtle. If you gave this answer in a PhD qualification exam in Physics, then sure, I think it's fair for someone to say you're wrong. If you gave the answer on a marketing page for a general-purpose chatbot? Meh.
(As an aside, this conversation is interesting to me primarily because it's a perfect example of how scientists go wrong in presenting their work to the world...meeting up with AI criticism on the other side.)
...only if you omit the parts where it talks about pressure differentials, caused by airspeed differences, create lift?
Both of these points are true. You have to be motivated to ignore them.
That doesn't matter for lay audieces and doesn't really matter at all until we try and use them for technical things.
The real question is, if you go back to the bot following this conversation and you challenge it, does it generate the more correct answer?
Common misconceptions should be expected when you train a model to act like the average of all humans.
https://jimruttshow.blubrry.net/the-jim-rutt-show-transcript...
This is an LLM. "Wrong" is not a concept that applies, as it requires understanding. The explanation is quite /probable/, as evidenced by the fact that they thought to use it as an example…
> “What actually causes lift is introducing a shape into the airflow, which curves the streamlines and introduces pressure changes – lower pressure on the upper surface and higher pressure on the lower surface,” clarified Babinsky, from the Department of Engineering. “This is why a flat surface like a sail is able to cause lift – here the distance on each side is the same but it is slightly curved when it is rigged and so it acts as an aerofoil. In other words, it’s the curvature that creates lift, not the distance.”
So I'd characterize this answer as "correct, but incomplete" or "correct, but simplified". It's a case where a PhD in fluid dynamics might state the explanation one way to an expert audience, but another way to a room full of children.
The hilarious thing about this subthread is that it's already getting filled with hyper-technical but wrong alternative explanations by people eager to show that they know more than the robot.
It's called the "equal transit-time fallacy" if you want to look it up, or follow the link I provided in my comment, or perhaps the NASA link someone else offered.
Pretty much any scientific question is fractal like this: there's a superficial explanation, then one below that, and so on. None are "completely incorrect", but the more detailed ones are better.
The real question is: if you prompt the bot for the better, deeper explanation, what does it do?
The equal transit time is not a partially correct explanation, it's something that doesn't happen. It's not a superficial explanation, it's a wrong explanation. It's not even a good lie-to-children, as it doesn't help predict or understand any part of the system at any level. It instead teaches magical thinking.
As to whether it matters? If I am told that I can ask my question to a system and it will respond like a team of PhDs, that it is useful to help someone with their homework and physical understanding, but it gives me instead information that is incorrect and misleading, I would say the system is not working as it is intended to.
Even if I accept that "audience matters" as you say, the suggested audience is helping someone with their physics homework. This would not be a suitable explanation for someone doing physics homework.
Wow. Thanks for your worry, but it's not a problem. I do understand the difference, and yet it doesn't have anything to do with the argument I'm making, which is about presentation.
> It's not even a good lie-to-children, as it doesn't help predict or understand any part of the system at any level.
...which is irrelevant in the context. I get the meta-point that you're (sort of) making that you can't shut your brain off and just hope the bot spits out 100% pedantic explanations of scientific phenomenon. That's true, but also...fine?
These things are spitting out probable text. If (as many have observed) this is a common enough explanation to be in textbooks, then I'm not particularly surprised if an LLM emits it as well. The real question is: what happens when you prompt it to go deeper?
> Air over the top has to travel farther in the same amount of time
is not true. The air on top does not travel farther in the same amount of time. The air slows down and travels a shorter distance in the same amount of time.
It's only "good enough for a classroom of children" in the same way that storks delivering babies is—i.e., if you're content to simply lie rather than bothering to tell the truth.
https://www.grc.nasa.gov/www/k-12/VirtualAero/BottleRocket/a...
A quite good example of AI limits
>In fact, theory predicts – and experiments confirm – that the air traverses the top surface of a body experiencing lift in a shorter time than it traverses the bottom surface; the explanation based on equal transit time is false.
So the effect is greater than equal time transit.
I've seen the GPT5 explanation in GCSE level textbooks but I thought it was supposed to be PhD level;)
These are places where common lay discussions use language in ways that is wrong, or makes simplifcations that are reasonable but technically incorrect. They are especially common when something is so 'obvious' that experts don't explain it, the most frequent version of the concepts being explained
These, in my testing, show up a lot in LLMs - technical things are wrong when the most language of the most common explanations simplifies or obfuscates the precise truth. Often, it pretty much matches the level of knowledge of a college freshman/sophmore or slightly below, which is sort of the level of discussion of more technical topics on the internet.
People seem to overcomplicate what LLM's are capable of, but at their core they are just really good word parsers.
"Your organization must be verified to use the model `gpt-5`. Please go to: https://platform.openai.com/settings/organization/general and click on Verify Organization. If you just verified, it can take up to 15 minutes for access to propagate."
And every way I click through this I end in an infinity loop on the site...
> GPT‑5’s reasoning_effort parameter can now take a minimal value to get answers back faster, without extensive reasoning first.
> While GPT‑5 in ChatGPT is a system of reasoning, non-reasoning, and router models, GPT‑5 in the API platform is the reasoning model that powers maximum performance in ChatGPT. Notably, GPT‑5 with minimal reasoning is a different model than the non-reasoning model in ChatGPT, and is better tuned for developers. The non-reasoning model used in ChatGPT is available as gpt-5-chat-latest.
[0] https://www.reuters.com/business/openai-eyes-500-billion-val...
This is not the happy path for GPT 5.
The table in the model card where every model in the current drop down somehow maps to 6 variants of 5 is not where most people thought we would be today.
The expectation was consolidation on a highly performant model, more multimodal improvements, etc.
This is not terrible, but I don't think anyone who's an "accelerationist" is looking at this as a win.
Meanwhile Sam Altman has been making the rounds fearmongering that AGI/ASI is right around the corner and that clearly is not the truth. It's fair to call them out on it.
So, if sama says this is going to be totally revolutionary for months, then uploads a Death Star reference the night before and then when they show it off the tech is not as good as proposed, laughter is the only logical conclusion.
Companies linking this to terminating us and getting rid of our jobs to please investors means we, whose uptake of this tech is required for their revenue goals, are skeptical about it and have a vested interest in it failing to meet expectations
How are they mindblowing? This was all possible on Claude 6 months ago.
> Major progress on multiple fronts
You mean marginal, tiny fraction of % progress on a couple of fronts? Cause it sounds like we are not seeing the same presentation.
> Yet, I like what I'm seeing.
Most of us don't
> So -- they did not invent AGI yet.
I am all for constant improvements and iterations over time, but with this pace of marginal tweak-like changes, they are gonna reach AGI never. And yes, we are laughing because sama has been talking big on agi for so long, and even with all the money and attention he can't be able to be even remotely close to it. Same for Zuck's comment on superintelligence. These are just salesmen, and we are laughing at them when their big words don't match their tiny results. What's wrong with that?
its not a "fix"
But up until now, especially from Sam Altman, we've heard countless veiled suggestions that GPT-5 would achieve AGI. A lot of the pro-AI people have been talking shit for the better part of the last year saying "just wait for GPT-5, bro, we're gonna have AGI."
The frustration isn't the desire to achieve AGI, it's the never-ending gaslighting trying to convince people (really, investors) that there's more than meets the eye. That we're only ever one release away from AGI.
Instead: just be honest. If you're not there, you're not there. Investors who don't do any technical evals may be disappointed, but long-term, you'll have more than enough trust and goodwill from customers (big and small) if you don't BS them constantly.
It feels a bit intentional
With a couple of more trillions from investors in his company, Sama can really keep launching successful, groundbreaking and innovative products like:
- Study Mode (a pre-prompt that you can craft yourself): https://openai.com/index/chatgpt-study-mode/
- Office Suite (because nothing screams AGI like an office suite: https://www.computerworld.com/article/4021949/openai-goes-fo...)
- ChatGPT5 (ChatGPT4 with tweaks) https://openai.com/gpt-5/
I can almost smell the singularity behind the corner, just a couple of trillion more! Please investors!
I am a synthetic biologist, and I use AI a lot for my work. And it constantly denies my questions RIGHT NOW. But of course OpenAI and Anthropic have to implement more - from the GPT5 introduction: "robust safety stack with a multilayered defense system for biology"
While that sounds nice and all, in practical terms, they already ban many of my questions. This just means they're going to lobotomize the model more and more for my field because of the so-called "experts". I am an expert. I can easily go read the papers myself. I could create a biological weapon if I wanted to with pretty much zero papers at all, since I have backups of genbank and the like (just like most chemical engineers could create explosives if they wanted to). But they are specifically targeting my field, because they're from OpenAI and they know what is best.
It just sucks that some of the best tools for learning are being lobotomized specifically for my field because of people in AI believe that knowledge should be kept secret. It's extremely antithetical to the hacker spirit that knowledge should be free.
That said, deep research and those features make it very difficult to switch, but I definitely have to try harder now that I see where the wind is blowing.
Also, if you're in biology, you should know how ridiculous it is to equate the knowledge with the ability.
From their Preparedness Framework: Biological and Chemical capabilities, Cybersecurity capabilities, and AI Self-improvement capabilities
GPT4 gave her better response than doctors she said.
Also, when you step back and look at a few of those incremental improvements together, they're actually pretty significant.
But it's hard not to roll your eyes each time they trot out a list of meaningless benchmarks and promise that "it hallucinates even less than before" again
> GPT‑5 is a unified system . . .
OK
> . . . with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say “think hard about this” in the prompt).
So that's not really a unified system then, it's just supposed to appear as if it is.
This looks like they're not training the single big model but instead have gone off to develop special sub models and attempt to gloss over them with yet another model. That's what you resort to only when doing the end-to-end training has become too expensive for you.
[1] https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson...
A broad generalization like "there are two systems of thinking: fast, and slow" doesn't necessarily fall into this category. The transformer itself (plus the choice of positional encoding etc.) contains inductive biases about modeling sequences. The router is presumably still learned with a fairly generic architecture.
You are making assumptions about how to break the tasks into sub models.
GPT-5 System Card [pdf] - https://news.ycombinator.com/item?id=44827046
If OpenAI really are hitting the wall on being able to scale up overall then the AI bubble will burst sooner than many are expecting.
- reasoning_effort parameter supports minimal value now in addition to existing low, medium, and high
- new verbosity parameter with possible values of low, medium (default), and high
- unlike hidden thinking tokens, user-visible preamble messages for tool calls are available
- tool calls possible with plaintext instead of JSON
> 128,000 max output tokens
> Input $1.25
> Output $10.00
Source: https://platform.openai.com/docs/models/gpt-5
If this performs well in independent needle-in-haystack and adherence evaluations, this pricing with this context window alone would make GPT-5 extremely competitive with Gemini 2.5 Pro and Claude Opus 4.1, even if the output isn't a significant improvement over o3. If the output quality ends up on-par or better than the two major competitors, that'd be truly a massive leap forward for OpenAI, mini and nano maybe even more so.
gpt-4.1 family had 1M/32k input/output tokens. Pricing-wise, it's 37% cheaper input tokens, but 25% more expensive on output tokens. Only nano is 50% cheaper on input and unchanged on output.
I would say GPT-5 reads more scientific and structured, but GPT-4 more human and even useful. For the prompt:
Is uncooked meat actually unsafe to eat? How likely is someone to get food poisoning if the meat isn’t cooked?
GPT-4 makes the assumption you might want to know safe food temperatures, and GPT-5 doesn't. Really hard to say which is "better", but GPT-4 seems more useful to every day folks, but maybe GPT-5 for the scientific community?
Then interesting that on ChatGPT vibe check website "Dan's Mom" is the only one who says it's a game changer.
Compare that to
Gemini 2.5 Pro knowledge cutoff: Jan 2025 (3 months before release)
Claude Opus 4.1: knowledge cutoff: Mar 2025 (4 months before release)
https://platform.openai.com/docs/models/compare
https://deepmind.google/models/gemini/pro/
https://docs.anthropic.com/en/docs/about-claude/models/overv...
I don't know if it's because of context clogging or that the model can't tell what's a high quality source from garbage.
I've defaulted to web search off and turn it on via the tools menu as needed.
I love HN though, it's all good.
I heard replit is good here with full vertical integration, but I haven't tried it in years.
They've mentioned improvements in that aspects a few times now, and if it actually materializes, that would be a big leap forward for most users even if underneath GPT-4 was also technically able to do the same things if prompted just the right way.
Gotta be polite with our future overlords!
As a user, it feels like the race has never been as close as it is now. Probably dumb to extrapolate, but it makes me skew a bit more skeptical about the hard take-off / winner-take-all mental model that has been pushed (though I'm sure that narrative helps with large-scale fundraising!)
This isn’t rocket science.
I am not an AI researcher, but I have friends who do work in the field, and they are not worried about LLM-based AGI because of the diminishing returns on results vs amount of training data required. Maybe this is the bottleneck.
Human intelligence is markedly different from LLMs: it requires far fewer examples to train on, and generalizes way better. Whereas LLMs tend to regurgitate solutions to solved problems, where the solutions tend to be well-published in training data.
That being said, AGI is not a necessary requirement for AI to be totally world-changing. There are possibly applications of existing AI/ML/SL technology which could be more impactful than general intelligence. Search is one example where the ability to regurgitate knowledge from many domains is desirable
(Which was considered AI not too long ago.)
For a very early example:
https://en.wikipedia.org/wiki/Centrifugal_governor
It's hard to separate out the P, I and D from a mechanical implementation but they're all there in some form.
> There are possibly applications of existing AI/ML/SL technology which could be more impactful than general intelligence
It's not unreasonable to ask for an example.
But my bigger point here is you don't need totally general intelligence to destroy the world either. The drone that targets enemy soldiers does not need to be good at writing poems. The model that designs a bioweapon just needs a feedback loop to improve its pathogen. Yet it takes only a single one of these specialized doomsday models to destroy the world, no more than an AGI.
Although I suppose an AGI could be more effective at countering a specialized AI than vice-versa.
Most human beings out there with general intelligence are pumping gas or digging ditches. Seems to me there is a big delusion among the tech elites that AGI would bring about a superhuman god rather than a ethically dubious, marginally less useful computer that can't properly follow instructions.
For now the humans are winning on two dimensions: problem complexity and power consumption. It had better stay that way.
If you've got evidence proving that an AGI will never be able to design a more powerful and competent successor, then please share it- it would help me sleep better, and my ulcers might get smaller.
Instead of writing code with exacting parameters, future developers will write human-language descriptions for AI to interpret and convert into a machine representation of the intent. Certainly revolutionary, but not true AGI in the sense of the machine having truly independent agency and consciousness.
In ten years, I expect the primary interface of desktop workstations, mobile phones, etc will be voice prompts for an AI interface. Keyboards will become a power-user interface and only used for highly technical tasks, similar to the way terminal interfaces are currently used to access lower-level systems.
For example, while you can get it to predict good chess moves if you train it on enough chess games, it can't really constrain itself to the rules of chess. (https://garymarcus.substack.com/p/generative-ais-crippling-a...)
These AI computers aren’t thinking, they are just repeating.
This argument has so many weak points it deserves a separate article.
I personally think it's a pretty reductive model for what intelligence is, but a lot of people seem to strongly believe in it.
They lack writable long-term memory beyond a context window. They operate without any grounded perception-action loop to test hypotheses. And they possess no executive layer for goal directed planning or self reflection...
Achieving AGI demands continuous online learning with consolidation.
The fortunate thing is that we managed to invent an AI that is good at _copying us_ instead of being a truly maveric agent, which kinda limits it to the "average human" output.
However, I still think that all the doomer arguments are valid, in principle. We very well may be doomed in our lifetimes, so we should take the threat very seriously.
1. The will of its creator, or
2. Its own will.
In the case of the former, hey! We might get lucky! Perhaps the person who controls the first super-powered AI will be a benign despot. That sure would be nice. Or maybe it will be in the hands of democracy- I can't ever imagine a scenario where an idiotic autocratic fascist thug would seize control of a democracy by manipulating an under-educated populace with the help of billionaire technocrats.
In the case of the latter, hey! We might get lucky! Perhaps it will have been designed in such a way that its own will is ethically aligned, and it might decide that it will allow humans to continue having luxuries such as self-determination! Wouldn't that be nice.
Of course it's not hard to imagine a NON-lucky outcome of either scenario. THAT is what we worry about.
Even if it is similar to today's tech, and doesn't have permanent memory or consciousness or identity, humans using it will. And very quickly, they/it will hack into infrastructure, set up businesses, pay people to do things, start cults, autonomously operate weapons, spam all public discourse, fake identity systems, stand for office using a human. This will be scaled thousands or millions of times more than humans can do the same thing. This at minimum will DOS our technical and social infrastructure.
Examples of it already happening are addictive ML feeds for social media, and bombing campaigns targetting based on network analysis.
The frame of "artificial intelligence" is a bit misleading. Generally we have a narrow view of the word "intelligence" - it is helpful to think of "artificial charisma" as well, and also artificial "hustle".
Likewise, the alienness of these intelligences is important. Lots of the time we default to mentally modelling AI as human. It won't be, it'll be freaky and bizarre like QAnon. As different from humans as an aeroplane is from a pigeon.
Given an (at this point still hypothetical, I think) AI that can accurately synthesize publicly available information without even needing to develop new ideas, and then break the whole process into discrete and simple steps, I think that protective friction is a lot less protective. And this argument applies to malware, spam, bioweapons, anything nasty that has so far required a fair amount of acquirable knowledge to do effectively.
It's not architectures that matter anymore, it's unlocking new objectives and modalities that open another axis to scale on.
The improvements they make are marginal. How long until the next AI breakthrough? Who can tell? Because last time it took decenia.
Seriously, our government just announced it's slashing half a billion dollars in vaccine research because "vaccines are deadly and ineffective", and it fired a chief statistician because the president didn't like the numbers he calculated, and it ordered the destruction of two expensive satellites because they can observe politically inconvenient climate change. THOSE are the people you are trusting to keep an eye on the pace of development inside of private, secretive AGI companies?
It's natural if you extrapolate from training loss curves; a training process with continually diminishing returns to more training/data is generally not something that suddenly starts producing exponentially bigger improvements.
> Academics distorting graphs to make their benchmarks appear more impressive
> lavish 1.5 million dollar bonuses for everyone at the company
> Releasing an open source model that doesn't even use latent multi head attention in a open source AI world led by Chinese labs
> Constantly overhyping models as scary and dangerous to buy time to lobby against competitors and delay product launches
> Failing to match that hype as AGI is not yet here
https://help.openai.com/en/articles/6825453-chatgpt-release-...
"If you open a conversation that used one of these models, ChatGPT will automatically switch it to the closest GPT-5 equivalent."
- 4o, 4.1, 4.5, 4.1-mini, o4-mini, or o4-mini-high => GPT-5 - o3 => GPT-5-Thinking - o3-Pro => GPT-5-Pro
GPT-5: Key characteristics, pricing and model card - https://news.ycombinator.com/item?id=44827794
I know these companies do "shadow" updates continuously anyway so maybe it is meaningless but would be super interesting to know, nonetheless!
Thank you to Simon; your notes are exactly what I was hoping for.
Did you ask it to format the table a couple paragraphs above this claim after writing about hallucinations? Because I would classify the sorting mistake as one
I would like to see a demo where they go through the bug, explain what are the tricky parts and show how this new model handle these situations.
Every demo I've seen seems just the equivalent of "looks good to me" comment in a merge request.
Given the low cost of GPT-5, compared to the prices we saw with GPT-4.5, my hunch is that this new model is actually just a bunch of RL on top of their existing models + automatic switching between reasoning/non-reasoning.
Something similar with this might happen, an underlying curse hidden inside an apparenting ground-breaking desigb.
[1] https://chatgpt.com/s/t_6894f13b58788191ada3fe9567c66ed5
The actual benchmark improvements are marginal at best - we're talking single-digit percentage gains over o3 on most metrics, which hardly justifies a major version bump. What we're seeing looks more like the plateau of an S-curve than a breakthrough. The pricing is competitive ($1.25/1M input tokens vs Claude's $15), but that's about optimization and economics, not the fundamental leap forward that "GPT-5" implies. Even their "unified system" turns out to be multiple models with a router, essentially admitting that the end-to-end training approach has hit diminishing returns.
The irony is that while OpenAI maintains their secretive culture (remember when they claimed o1 used tree search instead of RL?), their competitors are catching up or surpassing them. Claude has been consistently better for coding tasks, Gemini 2.5 Pro has more recent training data, and everyone seems to be converging on similar performance levels. This launch feels less like a victory lap and more like OpenAI trying to maintain relevance while the rest of the field has caught up. Looking forward to seeing what Gemini 3.0 brings to the table.
For all of them, getting access to full-blown GPT-5 will probably be mind-blowing, even if it's severely rate-limited. OpenAI's previous/current generation of models haven't really been ergonomic enough (with the clunky model pickers) to be fully appreciated by less tech-savvy users, and its full capabilities have been behind a paywall.
I think that's why they're making this launch is a big deal. It's just an incremental upgrade for the power users and the people that are paying money, but it'll be a step-change in capability to everyone else.
They've topped and are looking to cash out:
https://www.reuters.com/business/openai-eyes-500-billion-val...
atonse•2h ago
I don't even try to use the OpenAI models because it's felt like night and day.
Hopefully GPT-5 helps them catch up. Although I'm sure there are 100 people that have their own personal "hopefully GPT-5 fixes my personal issue with GPT4"
NitpickLawyer•2h ago
4.1 was almost usable in that fashion. I had 4.1-nano working in cline with really trivial stuff (add logging, take this example and adapt it in this file, etc) and it worked pretty well most of the time.
IdealeZahlen•2h ago
atonse•2h ago
weego•2h ago
Yesterday without much promoting Claude 4.1 gave me 10 phases, each with 5-12 tasks that could genuinely be used to kanban out a product step by step.
Claude 3.7 sonnet was effectively the same with fewer granular suggestions for programming strategies.
Gemini 2.5 gave me a one pager back with some trivial bullet points in 3 phases, no tasks at all.
o3 did the same as as Gemini, just less coherent.
Claude just has whatever the thing is for now
unshavedyak•1h ago
concinds•46m ago
bamboozled•2h ago
deadbabe•47m ago
jstummbillig•2h ago
mlsu•2h ago
octo888•1h ago
pawelduda•43m ago