GPT-5

974•rd•3h ago

https://www.youtube.com/watch?v=0Uu_VJeVVfo

Comments

atonse•4h ago

For day to day coding, I've found Anthropic to be killing it with Sonnet 3.7 and now Sonnet 4, and Claude Code feeling like it has even bigger advantages over when it's used in Cursor (And I can't explain why).

I don't even try to use the OpenAI models because it's felt like night and day.

Hopefully GPT-5 helps them catch up. Although I'm sure there are 100 people that have their own personal "hopefully GPT-5 fixes my personal issue with GPT4"

NitpickLawyer•3h ago

Colleagues were saying that horizon alpha and beta were looking better than claude4 for frontend stuff, especially newer frameworks. I think the idea of having full + mini + nano is really good, as long as the smaller ones can reasonably handle small-ish tasks. You'd have your architect / plan whatever sessions with the large one, scoping out regular tasks for the -mini version and then the really easy ones to -nano.

4.1 was almost usable in that fashion. I had 4.1-nano working in cline with really trivial stuff (add logging, take this example and adapt it in this file, etc) and it worked pretty well most of the time.

IdealeZahlen•3h ago

Whatever the benchmarks might say, there's something about Claude that seems to deliver consistently (although not always perfect) quite reliable outputs across various coding tasks. I wonder what that 'secret sauce' might be and whether GPT-5 has figured it out too.

atonse•3h ago

That's been my experience too. Even though Gemini also does seem to do the fancy one-shot demo code well, in day to day coding, Claude seems to do a much better job of just understanding how programming actually works, what to do, what not to do, etc.

weego•3h ago

Agreed, I always give my one pager product briefs to AI to break down into phases and tasks, and then progress trackers. I explicitly prompt for verbose phases, tasks and test plans.

Yesterday without much promoting Claude 4.1 gave me 10 phases, each with 5-12 tasks that could genuinely be used to kanban out a product step by step.

Claude 3.7 sonnet was effectively the same with fewer granular suggestions for programming strategies.

Gemini 2.5 gave me a one pager back with some trivial bullet points in 3 phases, no tasks at all.

o3 did the same as as Gemini, just less coherent.

Claude just has whatever the thing is for now

unshavedyak•3h ago

How are you having claude track these phases/tasks? Eg are you having it write to a TASKS.md and update it after each phase?

concinds•2h ago

Gemini Pro or Flash?

bamboozled•3h ago

Claude is fast too, Gemini isn’t as good and just gets hung up on things Claude doesn’t.

deadbabe•2h ago

The secret is just better context engineering. There is no other “secret” sauce, all these models are built on the same concepts.

dudeinhawaii•40m ago

My experience has been that Claude Code is exceptional at tool use (and thus working with agentic IDEs) but... not the smartest coder. It will happy re-invent the wheel, create silos, or generate terrible code that you'll only discover weeks or months later. I've had to rollback weeks of code to discover major edge regressions that Claude had introduced.

Now, someone will say 'add more tests'. Sure. But that's a bandaid.

I find that the 'smarter' models like Gemini and o3 output better quality code overall and if you can afford to send them the entire context in a non-agentic way .. then they'll generate something dramatically superior to the agentic code artifacts.

That said, sometimes you just want speed to proof a concept and Claude is exceptional there. Unfortunately, proof of concepts often... become productionized rather than developers taking a step back to "do it right".

jstummbillig•3h ago

Well, since (like you pointed out) using the Anthropic models in different settings is not that exciting anymore, the difference is what Claude Code does. It's a good product.

mlsu•3h ago

Claude Code is good because the Anthropic models are trained/finetuned to be good at using it.

jstummbillig•40m ago

Sure, that might be part of it.

octo888•3h ago

Killing it - at what type of coding task? What "bigger advantages" specifically? What is night and day?

atonse•1h ago

Refactors, building non-trivial features (you can first write out a spec and have it follow that), understanding my code, writing tests, writing good quality documentation. Reasoning about my existing data model and where to plug into it.

On and on and on. Coming up with test plans, edge cases, accounting for the edge cases in its programming. Programming defensively. Fixing bugs.

octo888•21m ago

Thanks for the detail!

pawelduda•2h ago

Yup, Claude has been kicking GPT's ass for months now

echelon•4h ago

The leak last night seems to indicate this will be coding focused.

I'd imagine this must be a big leg up on Anthropic to warrant the "GPT-5" name?

mupuff1234•3h ago

I'm guessing they realized they have rip off the bandaid and release a GPT 5 at some point, and we're gonna see a relatively incremental improvement.

WXLCKNO•3h ago

It's very doubtful that they'd have any kind of magical breakthrough that makes the model anything other than incrementally better right now.

sosodev•3h ago

How do you figure? They’ve hinted that the reasoning breakthrough used to achieve gold in the IMO will be here in GPT-5.

gowld•3h ago

What breakthrough? The self-awarded "gold" IMO result was achieved by running the model for over 1hr per question.

sosodev•3h ago

That sounds like a breakthrough to me. I don’t think GPT-4 could accomplish the same thing given several hours to try.

vlovich123•3h ago

Said another way, 30 min less than what humans get? It’s on average 90 min per question.

mtlmtlmtlmtl•3h ago

And how much energy does a human being consume while spending 90 minutes on an IMO question?

jjmarr•3h ago

Probably more. 200 kcal (a shrinkflated bag of chips) is about 232 watt hours. A typical 4o query is 0.3 to 3 watt hours.

https://epoch.ai/gradient-updates/how-much-energy-does-chatg...

SketchySeaBeast•3h ago

But how much time does that 0.3 watt hour query take to run? They imply that an individual ChatGPT query takes 0.3-3 watt hours, but most queries come back in seconds, so we need to scale that over a whole hour of processing.

Edit: Scrolling down: "one second of H100-time per query, 1500 watts per H100, and a 70% factor for power utilization gets us 1050 watt-seconds of energy", which is how they get down to 0.3 = 1050/60/60.

OK, so if they run if for a full hour it's 1050*60*60 = 3.8 MW? That can't be right.

Edit Edit: Wait, no, it's just 1050 Watt Hours, right (though let's be honest, the 70% power utilization is a bit goofy - the power is still used)? So it's 3x the power to solve the same question?

andai•3h ago

The gold which Google won too, right?

og_kalu•3h ago

No Sam explicitly said that breakthrough wouldn't be in GPT-5

ozgung•3h ago

GPT-5 should mean a brand new model/architecture trained from scratch.

SkyPuncher•3h ago

It means nothing now.

It's the same as 4G vs 5G. They have a technical definition, but it's all about marketing.

monkpit•1h ago

It means 5 is more than 4, Claude only has a 4. Clearly 5 is better

rvz•3h ago

I hope that this live stream will tell you that this will be the definitive reason why web developers, JavaScript / TypeScript developers are going to be made completely obsolete at worse and at best, their jobs will be reduced at all levels.

The best part is, this is not even the real definition of "AGI" yet (whatever that means at this point).

More like 10% of the capability that was promised and already the flow of capital from the inflated salaries of the past decade are going to the top AI researchers.

flawn•3h ago

Why do you hope this so much? Any personal reasons?

lbrito•3h ago

I suspect sarcasm

rvz•3h ago

Because it is true.

moribvndvs•3h ago

So, arbitrarily, it will just be “JavaScript/TypeScript developers” affected and everyone else will be fine?

rvz•3h ago

They are the worst affected. Nothing you can do about it.

moribvndvs•1h ago

Why them specifically? Why not Python developers, for example, which are well represented in models?

wiseowise•1h ago

Presumably Python devs are already in the gutter and only "worthwhile" Python devs are the ones that don't identify as "Python devs", e.g. scientists, data engineers, etc.

ethan_smith•3h ago

Tools like GPT-5 will transform web development rather than replace developers - the most valuable skills will shift toward problem definition, architecture design, and quality verification while repetitive coding gets automated.

code_for_monkey•3h ago

damn did a front end engineer hook up with your wife? What did I do to you?

rvz•3h ago

Sounds like you are coping over GPT-5 and your (soon to be replaced) job.

So sorry about that.

password321•2h ago

The thing is most white-collar workers could lose their job today and nothing of value to society would be lost. They were already hired for reasons that aren't related to productivity.

tekacs•3h ago

For those who haven't seen, a little bit of early stuff:

Official OpenAI gpt-5 coding examples repo: https://github.com/openai/gpt-5-coding-examples (https://news.ycombinator.com/item?id=44826439)

Github leak: https://news.ycombinator.com/item?id=44826439

cuuupid•3h ago

These are honestly pretty disappointing :/ this quality was possible with Claude Code months ago

tekacs•3h ago

Yep, agreed -- the repo is talking about 'one prompt with an agentic coding platform, but... at least here there's nothing particularly new.

Will be interesting to see what pushing it harder does – what the new ceiling is. 88% on aider polyglot is pretty good!

rkozik1989•3h ago

Honestly, why would anyone find this information useful? Creating a brand new greenfield project is a terrible test. Because literally anything it outputs as long as it looks good as long as it works following the happy path. Coding with LLMs falls apart in situations where complex reasoning is required. Situations such as having debugging issues in a service where there's either no framework in use or they've significantly modified a framework to make it better suit the authors needs.

hombre_fatal•3h ago

Yeah, I guess it's just the easiest thing to generate and evaluate.

A more useful demonstration like making large meaningful changes to a large complicated codebase would be much harder to evaluate since you need to be familiar with the existing system to evaluate the quality of the transformation.

Would be kinda cool to instead see diffs of nontrivial patches to the Ruby on Rails codebase or something.

gedy•3h ago

> Honestly, why would anyone find this information useful?

This seems to impress the mgmt types a lot, e.g. "I made a WHOLE APP!", when basically what most of this is is frameworks and tech that had crappy bootstrapping to begin with (React and JS are rife with this, in spite of their popularity).

0xFEE1DEAD•3h ago

I wish they wouldn't use JS to demonstrate the AI's coding abilities - the internet is full of JS code and at this point I expect them to be good at it. Show me examples in complex (for lack of a better word) languages to impress me.

I recently used OpenAI models to generate OCaml code, and it was eye opening how much even reasoning models are still just copy and paste machines. The code was full of syntax errors, and they clearly lacked a basic understanding of what functions are in the stdlib vs those from popular (in OCaml terms) libraries.

Maybe GPT-5 is the great leap and I'll have to eat my words, but this experience really made me more pessimistic about AI's potential and the future of programming in general. I'm hoping that in 10 years niche languages are still a thing, and the world doesn't converge toward writing everything in JS just because AIs make it easier to work with.

gedy•3h ago

> the internet is full of JS code and at this point I expect them to be good at it.

Isn't that the rub though? It's not an ex nihlo "intelligence", it's whatever stuff it's trained on and can derive completions from.

0xFEE1DEAD•3h ago

Yes, for me it is and it was even before this experience. But, you know, there's a growing crowd that believes AI is almost at AGI level and that they'll vibe code their way to a Fortune 100 company.

Maybe I spend too much time rage baiting myself reading X threads and that's why I feel the need to emphasize that AI isn't what they make it out to be.

wiseowise•2h ago

> they'll vibe code their way to a Fortune 100 company

You don't need more than JS for that.

thewebguyd•3h ago

> I wish they wouldn't use JS to demonstrate the AI's coding abilities - the internet is full of JS code and at this point I expect them to be good at it. Show me examples in complex (for lack of a better word) languages to impress me.

Agreed. The models break down on not even that complex of code either, if it's not web/javascript. Was playing with Gemini CLI the other day and had it try to make a simple Avalonia GUI app in C#/.NET, kept going around in circles and couldn't even get a basic starter project to build so I can imagine how much it'd struggle with OCaml or other more "obscure" languages.

This makes the tech even less useful where it'd be most helpful - on internal, legacy codebases, enterprisey stuff, stacks that don't have numerous examples on github to train from.

0xFEE1DEAD•2h ago

> on internal, legacy codebases, enterprisey stuff

Or anything that breaks the norm really.

I recently wrote something where I updated a variable using atomic primitives. Because it was inside a hot path I read the value without using atomics as it was okay for the value to be stale. I handed it the code because I had a question about something unrelated and it wouldn't stop changing this piece of code to use atomic reads. Even when I prompted it not to change the code or explained why this was fine it wouldn't stop.

atq2119•1h ago

FWIW, and this depends on the language obviously, but formal memory models typically do forbid races between atomic and non-atomic accesses to the same memory location.

While what you were doing may have been fine given your context, if you're targeting e.g. standard C++, you really shouldn't be doing it (it's UB). You can usually get the same result with relaxed atomic load/store.

(As far as AI is concerned, I do agree that the model should just have followed your direction though.)

robotpepi•2h ago

I've tried with many models to program in mathematica and sagemath; they're terrible, even with lots of hints.

sergiotapia•3h ago

It will be like coming home after such a long time using Sonnet 4 for all code and UI/UX work. I do hope sincerely this brings OpenAI back on top! Would be awesome to have a new king again.

"This repository contains a curated collection of demo applications generated entirely in a single GPT-5 prompt, without writing any code by hand."

https://github.com/openai/gpt-5-coding-examples

This is promising!

xyst•3h ago

> comments turned off

yikes - the poor executive leadership’s fragile egos cannot take the criticism.

0x457•3h ago

I don't think YouTube comment section ever contain useful information regardless of what the video/stream is about.

speedgoose•3h ago

I don’t know. Live comment feeds on popular streams makes me question democracy.

bangaladore•3h ago

Have you seen YouTube comments on videos like this? It's all-crypto scams, bots responding to other bots, and occasional racism.

nerevarthelame•3h ago

It's a shame, because that seems like the sort of thing LLMs would be able to moderate quite effectively, if YouTube was willing to put the effort in.

koolala•3h ago

What's up with their very first eval? The SWE bars and numbers don't line up.

wiseowise•2h ago

Assuming even 10% of YouTube commenters are real people.

aliljet•3h ago

It's very unclear if OpenAI has been casually leaking things to create buzz, but a few days ago there was a pretty stunning pelican on a bike attempt: https://old.reddit.com/r/OpenAI/comments/1mettre/gpt5_is_alr...

In practice, it's very clear to me that the most important value in writing software with an LLM isn't it's ability to one-shot hard problems, but rather it's ability to effectively manage complex context. There are no good evals for this kind of problem, but that's what I'm keenly interested in understanding. Show me GPT-5 can move through 10 steps in a list of tasks without completely losing the objective by the end.

stri8ed•3h ago

That problem along with its many solutions are surely littered throughout the training data. Not to mention, it would be trivial to overfit on that problem. I don't know why people still reference that.

aliljet•3h ago

Honestly, you're probably right. It's quickly become a pretty weak eval, but the guy that's running that eval is excellent. I'd much rather the evals people were using to test these things looked more like classic/boring engineering problems: deploy to dev/test/stage/prod with digital ocean, cloudflare, github, and a common git flow. Boring problem, I know, but that problem is wildly complex when you start to add a few extra dimensions (frontend vs backend, ports shifting between deployments, local deployments, etc.).

93po•3h ago

i think the point is people assume models arent overfitting for it, and its a fun/silly way to potentially gauge its general abilities

ben_w•3h ago

> That problem along with its many solutions are surely littered throughout the training data. Not to mention, it would be trivial to overfit on that problem.

It would be trivial to over-fit, if that was their goal.

But why would there be a large number of good SVG images of pelicans on bikes? Especially relative to all the things we actually want them to generalise over?

Surely most of the SVG images of pelicans on bikes are, right now, going to be "look at this rubbish AI output"? (Which may or may not be followed by a comment linking to that artist who got humans to draw bikes and oh boy were those humans wildly bad at drawing bikes, so an AI learning to draw SVGs from those bitmap pictures would likely also still suck…)

AlecSchueler•3h ago

Because it's become the iconic test for them and countless articles have been written about it with plenty of examples.

ben_w•3h ago

I added the word "good" in there, you may have replied before seeing that edit.

Xenoamorphous•3h ago

Maybe we can try “dog in a paraglider”? If it fails then we know it’s overfitting, if it works then the model generalises well?

sundarurfriend•3h ago

Since the stream has been on some starting screen for several minutes, I went to check whether there are watch-along streams on Twitch for this - there are a few, and for some reason every one of them is in Spanish. I know Spanish-language streams are a big thing, but it's curious that there's three Spanish GPT-5 watchalong streams (two with 50-ish viewers and one with 2.5k) and none in English.

edit: YouTube has a few English "watch party" streams, although there too, the Spanish ones have many times more viewers.

wiradikusuma•3h ago

A bit unrelated: The "countdown animation", just like Google I/O's, how do people make those? The countdown is probably dynamically generated, as they don't know when the event will actually start? Is there like a JavaScript library, or CapCut template, or something?

Especially Google IO, each year is different, it seems purpose built?

ascorbic•3h ago

They do know when it starts. They have it prerecorded and start it at a specific time. This one started 10 minutes before.

Philpax•3h ago

Congratulations on winning the race to post the announcement :)

frenchie4111•3h ago

Did you win the race to be the first comment?

swyx•3h ago

# GPT5 all official links

Livestream link: https://www.youtube.com/live/0Uu_VJeVVfo

Research blog post: https://openai.com/index/introducing-gpt-5/

Developer blog post: https://openai.com/index/introducing-gpt-5-for-developers

API Docs: https://platform.openai.com/docs/guides/latest-model

Note the free form function calling documentation: https://platform.openai.com/docs/guides/function-calling#con...

GPT5 prompting guide: https://cookbook.openai.com/examples/gpt-5/gpt-5_prompting_g...

GPT5 new params and tools: https://cookbook.openai.com/examples/gpt-5/gpt-5_new_params_...

GPT5 frontend cookbook: https://cookbook.openai.com/examples/gpt-5/gpt-5_frontend

prompt migrator/optimizor https://platform.openai.com/chat/edit?optimize=true

Enterprise blog post: https://openai.com/index/gpt-5-new-era-of-work

System Card: https://openai.com/index/gpt-5-system-card/

What would you say if you could talk to a future OpenAI model? https://progress.openai.com/

coding examples: https://github.com/openai/gpt-5-coding-examples

slowmovintarget•3h ago

The coding examples link returns a 404.

owlninja•3h ago

https://github.com/openai/gpt-5-coding-examples

perching_aix•3h ago

Aaand hugged to death.

edit:

livestream here: https://www.youtube.com/live/0Uu_VJeVVfo

swyx•3h ago

our hands on review: https://www.latent.space/p/gpt-5-review

basically in my testing really felt that gpt5 was "using tools to think" rather than just "using tools". it gets very powerful when coding long horizon tasks (a separate post i'm publishing later).

to give one substantive example, in my developer beta (they will release the video in a bit) i put it to a task that claude code had been stuck on for the last week - same prompts - and it just added logging to instrument some of the failures that we were seeing and - from the logs that it added and asked me to rerun - figured out the solve.

diggan•3h ago

Was just skimming along that review, while watching the live-stream, where they just mentioned how much better at writing prose GPT-5 is, while I skimmed across:

> It’s actually worse at writing than GPT-4.5

Sounds like we need to wait a bit for the dust to settle before one can trust anything one hears/reads :)

tough•3h ago

well it's difficult to trust the people selling it in the first place. They're too biased to not lie

It's hard to make a man understand something standing between them and their salary

barrell•2h ago

I don’t think that’s a valid excuse. Yes marketing speak has always existed, but it’s not like companies have always been completely unreliable.

I found it strange that, despite my excitement for such an event being roughly equivalent to WWDC these days, I had 0 desire to watch the live stream for exactly this reason: it’s not like they’re going to give anything to us straight.

Even this years WWDC I at least skipped through video afterwards. Before I used to have watch parties. Yes they’re overly positive and paint everything in a good light, but they never felt… idk whatever the vibe is I get from these (applicable to OpenAI, Grok, Meta, etc)

It’s been just a few years of a revolutionary technology and already the livestreams are less appealing than the biggest corporations yearly events. Personally I find that sad

swyx•3h ago

better than 4o but worse than 4.5 is internally consistent. and ofc writing is extremely multidimensional.

WhitneyLand•2h ago

But that’s not what the review says:

“It’s actually worse at writing than GPT-4.5, and I think even 4o”

So the review is not consistent with the PR, hence the commenter expressing preference for outside sources.

dudeinhawaii•34m ago

This is why you need to have your own set of personal benchmarks. I have a few short stories that I have models continue (ones I wrote ages ago in my youth) or refactor. Some models are fantastic at writing but miss key details or enmesh them (Claude). Some are terrible writers at higher reasoning (o3). Some are decent writers but tend to provide very short outputs (gpt-4o). For my personal benchmarks Gemini 2.5 Pro has always generated the most compelling writing that _also_ sticks to the world/script -- and sometimes surprises me by having characters react in ways that I hadn't considered but are consistent with their "worldview" as presented by the context (usually a world guide).

I find coding to be harder to benchmark because there are so many ways to write the same solution. A "correct" solution may be terrible in another context due to loss of speed, security, etc.

thierrydamiba•3h ago

Great writeup. I particularly like the idea of splitting your tools into the four buckets.

1)Internal Retrieval

2)Web Search

3)Code Interpreter

4)Actions

How did you come up with this idea?

swyx•3h ago

ive been pushing the idea of "the Big 3" tools (https://news.smol.ai/issues/25-05-27-mistral-agents and https://www.latent.space/p/agent) and Ben added a 4th

lyxell•3h ago

”I think GPT-5 is the closest to AGI we’ve ever been”

Sorry, but this sounds like overly sensational marketing speak and just leaves a bad taste in the mouth for me.

Aurornis•2h ago

I found a Hacker News thread via Google a few days ago. One of the top comments was from someone describing their RAG architecture and a certain technique (my search term). The comment boasted that their system was so good it that their team thought they created something close to AGI.

Then I noticed the date on the comment: 2023.

Technically, every advancement in the space is “the closest to AGI that we’ve ever been”. It’s technically correct, since we’re not moving backward. It’s just not a very meaningful statement.

Dr4kn•2h ago

"It's the best iPhone we ever made."

JumpCrisscross•2h ago

> Technically, every advancement in the space is “the closest to AGI that we’ve ever been”

By that standard Neolithic tool use was progress to AGI.

wild_egg•2h ago

Technically correct

Fargren•2h ago

AGI, like AI before it, has been coopted into a marketing term. Most of the time, outside of sci-fi, what people mean when they say AGI is "a profitable LLM".

In the words OpenAI: “AGI is defined as highly autonomous systems that outperform humans at most economically valuable work”

dingnuts•2h ago

SamA isn't an idiot. When he says AGI he wants you to think of Asimov style AI. Don't run defense for billionaire grifters.

martin_a•2h ago

Same marketing BS like "the best iPhone ever!". Well, duh, if your new version (of hardware/software) isn't better, what the deal then?

billmalarky•2h ago

Hi Swyx I always appreciate your insights, something you wrote really resonated with a personal theory I've been developing:

>"While I never use AI for personal writing (because I have a strong belief in writing to think)"

The optimal AI productivity process is starting to look like:

AI Generates > Human Validates > Loop

Yet cognitive generation is how humans learn and develop cognitive strength, as well as how they maintain such strength.

Similar to how physical activity is how muscles/bone density/etc grow, and how body tissues maintain.

Physical technology freed us from hard physical labor that kept our bodies in shape -- at a cost of physical atrophy.

AI seems to have a similar effect for our minds. AI will accelerate our cognitive productivity, and allow for cognitive convenience -- at a cost of cognitive atrophy.

At present we must be intentional about building/maintaining physical strength (dedicated strength training, cardio, etc).

Soon we will need to be intentional about building/maintaining cognitive strength.

I suspect the workday/week of the future will be split on AI-on-a-leash work for optimal productivity, with carve-outs for dedicated AI-enhanced-learning solely for building/maintaining cognitive health (where productivity is not the goal, building/maintaining cognition is). Similar to how we carve out time for working out.

What are your thoughts on this? Based on what you wrote above, it seems you have similar feelings?

Is there a name for this theory?

If not can you coin one? You're great at that :)

nabla9•2h ago

Holy misleading graph Batman/Altman!

Academic benchmark score improves only 5% but they make the bar 50% higher.

ambicapter•2h ago

Which graph? There are dozens of graph on the page.

swyx•2h ago

they had a bad graph on stream which they have since fixed. wouldnt get too upset about a simple error

elros•2h ago

https://www.youtube.com/watch?v=0Uu_VJeVVfo

punnerud•3h ago

Wished this version would be called OpenAI-GPT-25.8

FergusArgyll•3h ago

I need a 2x speed on live video

cmdrk•3h ago

just wait for the AI summary

scrollop•2h ago

The free version of Gemini 2.5 mini is great for this- doesn't need a transcript, apparently can analyse the video as well

risyachka•3h ago

Considering how they hyped it up (eg. “Lol normies go about their day and have no idea whats coming etc”) they have to show some AGI level llm or stop overhyping their 2% improvements.

mrinterweb•3h ago

Hopefully, OpenAI makes their APIs more affordable. So far, there are alternative LLMs and services that both outperform and are a fraction of OpenAI's pricing. OpenAI is usually one of (if not) the most expensive option, maybe that's because of the brand identification. Not really sure why people pay that premium.

sharkjacobs•3h ago

> there are alternative LLMs and services that both outperform and are a fraction of OpenAI's pricing

Like what? Deepseek?

bmau5•3h ago

It's very interesting how memetic the language around different models is. Elon seems to have coined "PhD level intelligence in all topics" and now Sam repeated it in his presentation. Despite it not having an actual meaning. I think OpenAI will coin they've achieved AGI first (as they have incentives to based on the rumored contract with MSFT), and then everyone will claim we've achieved it.

riku_iki•3h ago

OpenAI has clear and strange definition of AGI in contract with MSFT: it should produce 100B economic impact.

bmau5•3h ago

Thanks for pointing that out, I missed that. Very curious how they'll measure that. Given they're in the double-digit billions of revenue I'd assume they can reason they'll be there very soon.

throwanem•3h ago

Who knew we've had AGI for something like three hundred years? (Or, only had NGI for so long?)

wavemode•3h ago

Does it count if the impact is negative?

og_kalu•2h ago

That not their definition of AGI. It's "a highly autonomous system that can outperform humans at most economically valuable work."

9rx•2h ago

Your half of the definition is implied, but uninteresting. They would not see the 100B economic impact without your definition being realized. But what is curious about it is that it is also not to be considered AGI without meeting the value marker. "a highly autonomous system that can outperform humans at most economically valuable work." alone is not sufficient.

og_kalu•1h ago

>Your half of the definition is implied, but uninteresting.

How is it uninteresting? Open AI had revenue of $12B last year without monetizing literally hundreds of millions of free users in any way whatsoever (not even ads).

Microsoft's cloud revenue has exploded in the last few years off the back of AI model services. Let's not even get into the other players.

100B in economic impact is more than achievable with the technology we have today right now. That half is the interesting part.

9rx•1h ago

> Open AI had revenue of $12B last year

And it could have been $1T for all anyone cares. The impact was delivered by humans. This is about impact delivered by AGI.

og_kalu•1h ago

That makes no sense. Money generated by direct usage is economic impact by the model.

If you use GPT-N substantially in your work, then saying that impact rests solely on you is nonsensical.

9rx•1h ago

> Money generated by direct usage is economic impact by the model.

But not at the "hand" of AGI. Perhaps you forgot to read your very own definition? Notably the "autonomous" part.

When AGI is set free and starts up "Closed I", generating $12B in economic value without humans steering the wheel, we will be (well, I will be, at least!) throughly impressed. But Microsoft won't be. They won't consider it AGI until it does $100B.

> If you use GPT-N substantially in your work, then saying that impact rests solely on you is nonsensical.

And if you use a hammer substantially in your work to generate $100B in value, a hammer is AGI according to you? You can hold that idea, but that's not what anyone else is talking about. The primary indicator of AGI, as you even said yourself, is autonomy.

og_kalu•1h ago

Maybe you missed the part where I explicitly said that wasn't their definition ?

“A highly autonomous system that outperforms humans at most economically valuable work.” is what's in their charter.

$100B in profits is a separate agreement with Microsoft that makes no mention of autonomity.

>And if you use a hammer substantially in your work to generate $100B in value, a hammer is AGI according to you? You can hold that idea, but that's not what anyone else is talking about. The primary indicator of AGI, as you even said yourself, is autonomy.

The primary indicator of AGI is whatever you want it to be. The words themselves make no promises of autonomity, simply an intelligence general in nature. We are simply discussing Open AI's definitions.

9rx•1h ago

> $100B in profits is a separate agreement with Microsoft that makes no mention of autonomity.

Again, autonomy is implied when talking about AGI. OpenAI selling tools like GPT or dishwashers, even if they were to provide the $100B in economic impact, would not satisfy the agreement. It is specifically about AGI, and there should be no confusion about what AGI is here as you helpfully defined it for us.

clueless•3h ago

Elon did not coin this, Kurzweil has been using this coinage for a lot longer.

bmau5•3h ago

Got it. I should have stated Elon used to describe their latest model release, not coined. Thank you

Telemakhos•2h ago

As a fairly dumb person with a PhD, I can attest that a degree means perseverance, not intelligence.

ToValueFunfetti•2h ago

After watching Claude cheerfully circle Team Rocket HQ for a month, I can attest that perseverence is not what stands between current models and PhDs.

seydor•2h ago

I gotta say PhDs are a dime a dozen these days, and yet we are talking about science stagnation.

And PhDs are not very smart imho (I am one)

bmau5•1h ago

As a non-PhD, most PhD's insist most PhD's aren't smart - and yet I find them to generally be the smartest people I know :)

kevinventullo•59m ago

I think “PhD level knowledge” is probably a more meaningful and accurate phrase.

_sword•3h ago

Neat, more scalable intelligence for me to tell "plz fix" over my code

aliljet•3h ago

The eval bar I want to see here is simple: over a complex objective (e.g., deploy to prod using a git workflow), how many tasks can GPT-5 stay on track with before it falls off the train. Context is king and it's the most obvious and glaring problem with current models.

CamelCaseName•3h ago

This sounds like the kind of thing:

1. I desperately want (especially from Google)

2. Is impossible, because it will be super gamed, to the detriment of actually building flexible flows.

sundarurfriend•3h ago

It's only when he stumbled a bit that I could tell for sure (well, mostly) that it wasn't an AI generated video - the corporate speak, body language mannerisms of Sam Altman, camera angles, all seemed pretty plausibly AI-generated!

minimaxir•3h ago

The marketing copy and the current livestream appear tautological: "it's better because it's better."

Not much explanation yet why GPT-5 warrants a major version bump. As usual, the model (and potentially OpenAI as a whole) will depend on output vibe checks.

krat0sprakhar•3h ago

> Not much explanation yet why GPT-5 warrants a major version bump

Exactly. Too many videos - too little real data / benchmarks on the page. Will wait for vibe check from simonw and others

collinmanderson•2h ago

> Will wait for vibe check from simonw

https://openai.com/gpt-5/?video=1108156668

2:40 "I do like how the pelican's feet are on the pedals." "That's a rare detail that most of the other models I've tried this on have missed."

4:12 "The bicycle was flawless."

5:30 Re generating documentation: "It nailed it. It gave me the exact information I needed. It gave me full architectural overview. It was clearly very good at consuming a quarter million tokens of rust." "My trust issues are beginning to fall away"

Edit: ohh he has blog post now: https://news.ycombinator.com/item?id=44828264

dimitri-vs•2h ago

This effectively kills this benchmark.

tuesdaynight•1h ago

Honestly, I have mixed feelings about him appearing there. His blog posts are a nice way to be updated about what's going on, and he deserves the recognition, but he's now part of their marketing content. I hope that doesn't make him afraid of speaking his mind when talking about OpenAI's models. I still trust his opinions, though.

anthonypasq•3h ago

its >o3 performance at gpt4 price. seems pretty obvious

thegeomaster•2h ago

o3 pricing: $8/Mtok out

GPT-5 pricing: $10/Mtok out

What am I missing?

anthonypasq•2h ago

pretty sure reduced cache input pricing is a pretty big deal for reasoning models, but im not positive

throwaway0123_5•1h ago

It seems like you might need less output tokens for the same quality of response though. One of their plots shows o3 needing ~14k tokens to get 69% on SWE-bench Verified, but GPT-5 needing only ~4k.

mitkebes•38m ago

O3 has had some major price cuts since Gemini 2.5 Pro came out. At the time, o3 cost $10/Mtok in and $40/Mtok out. The big deal with Gemini 2.5 Pro was it had comparable quality to o3 at a fraction of the cost.

I'm not sure when they slashed the o3 pricing, but the GPT-5 pricing looks like they set it to be identical to Gemini 2.5 Pro.

If you scroll down on this page you can see what different models cost when 2.5 Pro was released: https://deepmind.google/models/gemini/pro/

pram•3h ago

We’re at the audiophile stage of LLMs where people are talking about the improved soundstage, tonality, reduced sibilance etc

jaredcwhite•2h ago

Note GPT-5's subtle mouthfeel reminiscent of cranberries with a touch of bourbon.

__loam•2h ago

Every bourbon tastes the same unless it's Weller, King's County Peated, or Pappy (or Jim Beam for the wrong reasons lol)

alephnerd•2h ago

Tbh, a mid-shelf Four Roses gets you 90% of the way to a upper shelf Weller.

__loam•2h ago

I'm being hyperbolic but yeah four roses is probably the best deal next to Buffalo trace. All their stuff is fairly priced. If you want something like Weller though, you should get another wheated bourbon like Maker's Mark French oaked.

alephnerd•2h ago

Buffalo trace is ridiculously overpriced nowadays. Good bourbon, but def not worth $35-40 for 750ml.

> you should get another wheated bourbon like Maker's Mark French oaked

I agree. I've found Maker Mark products to be a great bang for your buck quality wise and flavor wise as well.

__loam•2h ago

If you can find Buffalo Trace for msrp which is $20-30, it's a good deal. I think the bourbon "market" kind of popped recently so finding things has been getting a little easier.

alephnerd•2h ago

Yep! I agree! At MSRP BT is a great buy.

> I think the bourbon "market" kind of popped recently

It def did. The overproduction that was invested in during the peak of the COVID collector boom is coming into markets now. I think we'll see some well priced age stated products in the next 3-4 years based on by acquaintances in the space.

Ofc, the elephant in the room is consolidation - everyone wants to copy the LVMH model (and they say Europeans are ethical elves who never use underhanded mopolistic and market making behavior to corner markets /s).

alephnerd•2h ago

Explains why I find AGI fundamentalists similar to tater heads. /s

(Not to undermine progress in the foundational model space, but there is a lack of appreciation for the democratization of domain specific models amongst HNers).

javchz•2h ago

I can already see LLMs Sommeliers: Yes, the mouthfeel and punch of GPT-5 it's comparable to the one of Grok 4, but it's tenderness lacks the crunch from Gemini 2.5 Pro.

0x7cfe•2h ago

Isn't it exactly what the typical LLM discourse is about? People are just throwing anecdotes and stay with their opinion. A is better than B because C, and that's basically it. And whoever tries to actually bench them gets called out because all benches are gamed. Go figure.

Q6T46nT668w6i3m•2h ago

It’s always been this way with LLMs.

catigula•2h ago

Informed audiophiles rely on Klippel output now

bobson381•2h ago

The empirical ones do! There's still a healthy sports car element to the scene though, at least in my experience.

catigula•2h ago

You're right, it's hard to admit you can buy a $50 speaker and sub and EQ it to 95% maximum performance.

riknos314•2h ago

This is and isn't true.

The room is the limiting factor in most speaker setups. The worse the room, the sooner you hit diminishing returns for upgrading any other part of the system.

In a fantastic room a $50 speaker will be nowhere near 95% of the performance of a mastering monitor, no matter how much EQ you put on it. In the average living room with less than ideal speaker and listening position placement there will still be a difference, but it will be much less apparent due to the limitations of the listening environment.

catigula•2h ago

Ah, the aforementioned snake oil.

jpc0•2h ago

Absolutely not true.

You might lose headroom or have to live with higher latency but if your complaint is about actual empirical data like frequency response or phase, that can be corrected digitally.

babypuncher•1h ago

You can only EQ speakers and headphones as far as the transducer can still respond accurately to the signal you're sending it. No amount of EQ will give the Sennheiser HD-600's good sub-bass performance because the driver begins to distort the signal long before you've amplified it enough to match the Harman target at a normal listening level.

DSP is a very powerful tool that can make terrible speakers and headphones sound great, but it's not magic.

satyrun•2h ago

Come on, we aren't even close to the level of audiophile nonsense like worrying about what cable sounds better.

leptons•2h ago

We're still at the stage of which LLM lies the least (but they all do). So yeah, no different than audiophiles really.

virgil_disgr4ce•2h ago

Well, reduced sibilance is an ordinary and desirable thing. A better "audiophile absurdity" example would be $77,000 cables, freezing CDs to improve sound quality, using hospital-grade outlets, cryogenically frozen outlets (lol), the list goes on and on

tuesdaynight•1h ago

You need to burn-in your LLM by using for 100 hours before you see the true performance of it.

scosman•3h ago

There's a bunch of benchmarks on the intro page including AIME 2025 without tools, SWE-bench Verified, Aider Polyglot, MMMU, and HealthBench Hard (not familiar with this one): https://openai.com/index/introducing-gpt-5/

Pretty par for course evals at launch setup.

WD-42•2h ago

It has the last ~6 months worth of flavor of the month Javascript libraries in it's training set now, so it's "better at coding".

How is this sustainable.

jcgrillo•2h ago

Vast quantities of extremely dumb money

sethops1•2h ago

Who said anything about sustainable? The only goal here is to hobble to the next VC round. And then the next, and the next, ...

some-guy•2h ago

As someone who tries to push the limits of hard coding tasks (mainly refactoring old codebases) to LLMs with not much improvement since the last round of models, I'm finding that we are hitting the reduction of rate of improvement on the S-curve of quality. Obviously getting the same quality cheaper would be huge, but the quality of the output day to day isn't noticeable to me.

camdenreslink•1h ago

I find it struggles to even refactor codebases that aren't that large. If you have a somewhat complicated change that spans the full stack, and has some sort of wrinkle that makes it slightly more complicated than adding a data field, then even the most modern LLMs seem to trip on themselves. Even when I tell it to create a plan for implementation and write it to a markdown file and then step through those steps in a separate prompt.

Not that it makes it useless, just that we seem to not "be there" yet for the standard tasks software engineers do every day.

nicetryguy•2h ago

Yeah. We're entered the Smartphone stage: "You want the new one because it's the new one."

croemer•3h ago

The Polyglot aider improvement over o3 is imperceptible, not great.

qsort•3h ago

SWE-Bench is also not stellar. "It's important to remember" that:

- they are only evals

- this is mostly positioned as a general consumer product, they might have better stuff for us nerds in hand.

dkeolu•3h ago

[flagged]

dang•2h ago

There's no way to directly contact another user other than by replying to a post of theirs and hoping for the best.

If you email us at hn@ycombinator.com and tell us who you want to contact, we might be able to email them and ask if they would be willing to have you contact them. No guarantees though!

ipnon•3h ago

Does this mean AGI is cancelled? 2027 hard takeoff was just sci-fi?

usaar333•3h ago

At this point the prediction for SWE bench (85% by end of this month) is not materializing. We're actually quite far away.

Keyframe•3h ago

Always has been.

machiaweliczny•2h ago

When to short NVIDIA? I guess when chinese get their cards production

ath3nd•2h ago

Short?

It's a perfect situation for Nvidia. You can see that after months of trying to squeeze out all % of marginal improvements, sama and co decided to brand this GPT-4.0.0.1 version as GPT-5. This is all happening on NVDA hardware, and they are gonna continue desperately iterating on tiny model efficiencies until all these valuation $$$ sweet sweet VC cash run out (most of it directly or indirectly going to NVDA).

cedws•2h ago

I'd rather they just call it GPT-5 than GPT 4.1o-Pro-Max like their current nightmare naming convention. I lost track of what the 'best' model is.

ath3nd•2h ago

They are all..kinda the same?

mvieira38•1h ago

It's good for NVDA if the AI companies can't squeeze more performance out of the same compute, which is the case if GPT-5 underperforms

bakuninsbart•42m ago

I think one thing to look out for are "deliberately" slow models. We are currently using basically all models as if we needed them in an instant loop, but many of these applications do not have to run that fast.

To tell a made-up anecdote: A colleague told me how his professor friend was running statistical models over night because the code was extremely unoptimized and needed 6+ hours to compute. He helped streamline the code and took it down to 30 minutes, which meant the professor could run it before breakfast instead.

We are completely fine with giving a task to a Junior Dev for a couple of days and see what happens. Now we love the quick feedback of running Claude Max for a hundred bucks, but if we could run it for a buck over night? Would be quite fine for me as well.

cyberpunk•11m ago

I don’t really see how this works though — Isn’t it the case that longer “compute” times are more expensive? Hogging a gpu overnight is going to be more expensive than hogging it for an hour.

growthwtf•2h ago

Good thing they didn't nuke the data centers after all!

tim333•32m ago

Still got 24 months to work on it.

Tenemo•3h ago

I wish they posted detailed metrics and benchmarks with such a "big" (loud) update.

minimaxir•3h ago

The current livestream listed the benchmarks (curiously comparing it only to previous GPT models and not competitors)

bamboozled•3h ago

AGI

demirbey05•3h ago

Seems LLMs really hit the wall.

dismalaf•3h ago

It's seemed that way for the last year. The only real improvements have been in the chat apps themselves (internet access, function calling). Until AI gets past the pre-training problem, it'll stagnate.

amelius•3h ago

Is there a graph somewhere that illustrates it?

onlyrealcuzzo•2h ago

https://epoch.ai/data-insights/llm-apis-accuracy-runtime-tra...

It is easier to get from 0% accurate to 99% accurate, than it is to get from 99% accurate to 99.9% accurate.

This is like the classic 9s problem in SRE. Each nine is exponentially more difficult.

How easy do we really think it will be for an LLM to get 100% accurate at physics, when we don't even know what 100% right is, and it's theoretically possible it's not even physically possible?

nonhaver•3h ago

i think this is more an effect of releasing a model every other month with gradual improvements. if there was no o-series/other thinking models on the market - people would be shocked by this upgrade. the only way to keep up with the market is to release improvements asap

ModernMech•2h ago

I don't agree, the only thing thing that would shock me about this model is if it didn't hallucinate.

I think the actual effect of releasing more models every month has been to confuse people that progress is actually happening. Despite claims of exponentially improved performance and the ability to replace PhDs, doctors, and lawyers, it still routinely can't be trusted the same as the original ChatGPT, despite years of effort.

nonhaver•1h ago

this is a very odd perspective. as someone who uses LLMs for coding/PRs - every time a new model released my personal experience was that it was a very solid improvement on the previous generation and not just meant to "confuse". the jump from raw GPT-4 2 years ago to o3 full is so unbelievable if you traveled back in time and showed me i wouldn't have thought such technology would exist for 5+ years.

to the point on hallucination - that's just the nature of LLMs (and humans to some extent). without new architectures or fact checking world models in place i don't think that problem will be solved anytime soon. but it seems gpt-5 main selling point is they somehow reduced the hallucination rate by a lot + search helps with grounding.

ModernMech•45m ago

I notice you don't bring any examples despite claiming the improvements are frequent and solid. It's likely because the improvements are actually hard to define and quantify. Which is why throughout this period of LLM development, there has been such an emphasis on synthetic benchmarks (which tell us nothing), rather than actual capabilities and real world results.

nonhaver•34m ago

i didnt bring examples because i said personal experience. heres my "evidence" - gpt 4 took multiple shots and iterations and couldnt stay coherent with a prompt longer than 20k tokens (in my experience). then when o4 came out it improved on that (in my experience). o1 took 1-2 shots with less iterations (in my experience). o3 zero shots most of the tasks i throw at it and stays coherent with very long prompts (in my experience).

heres something else to think about. try and tell everybody to go back to using gpt-4. then try and tell people to go back to using o1-full. you likely wont find any takers. its almost like the newer models are improved and generally more useful

impossiblefork•2h ago

Before last year we didn't have reasoning. It came with QuietSTaR, then we got it in the form of O1 and then it became practical with DeepSeek's paper in January.

So we're only about a year since the last big breakthrough.

I think we got a second big breakthrough with Google's results on the IMO problems.

For this reason I think we're very far from hitting a wall. Maybe 'LLM parameter scaling is hitting a wall'. That might be true.

demirbey05•2h ago

IMO is not breakthrough, if you craft proper prompts you can excel imo with 2.5 Pro. Paper : https://arxiv.org/abs/2507.15855. Google just put whole computational power with very high quality data. It was test-time scaling. Why it didn't solve problem 6 as well?

Yes, it was breakthrough but saturated quickly. Wait for next breakthrough. If they can build adapting weights in llm we can talk different things but test time scaling coming to end with increasing hallucination rate. No sign for AGI.

impossiblefork•2h ago

It wasn't long ago that test-time scaling wasn't possible. Test-time scaling is a core part of what makes this a breakthrough.

I don't believe your assessment though. IMO is hard, and Google have said that they use search and some way of combining different reasoning traces, so while I haven't read that paper yet, and of course, it may support your view, but I just don't believe it.

We are not close to solving IMO with publicly known methods.

demirbey05•1h ago

test time scaling is based on methods from pre-2020. If you look details of modern LLMs its pretty small prob to encounter method from 2020+(ROPE,GRPO). I am not saying IMO is not impressive, but it is not breakthrough, if they said they used different paradigm then test-time scaling I would say breakthrough.

> We are not close to solving IMO with publicly known methods. The point here is not method rather computation power. You can solve any verifiable task with high computation, absolutely there must be tweaks in methods but I don't think it is something very big and different. Just OAI asserted they solved with breakthrough.

Wait for self-adapting LLMs. We will see at most in 2 years, now all big tech are focusing on that I think.

impossiblefork•1h ago

What kind of test time scaling did we have pre-2020?

Non-output tokens were basically introduced by QuietSTaR, which is rather new. What method from five years ago does anything like that?

pton_xd•2h ago

Layman's perspective: we had hints of reasoning from the initial release of ChatGPT when people figured out you could prompt "think step by step" to drastically increase problem solving performance. Then yeah a year+ later it was cleverly incorporated into model training.

impossiblefork•2h ago

Fine, but to me reasoning is this the where you have <think> tags and use RL to decide what's to be generated in-between them.

Of course, people regarded things like GSM8k with trained reasoning traces as reasoning too, but it's pretty obviously not quite the same thing.

satyrun•2h ago

Just an absurd statement when DeepSeek had its moment in January.

A whole 8 months ago.

demirbey05•2h ago

I don't remember as a big fan of DeepSeek.

nonhaver•1h ago

you dont remember deepseek introducing reasoning and blowing benchmarks led by private american companies out of the water? with an api that was way cheaper? and then offered the model free in a chat based system online? and you were a big fan?

manojlds•2h ago

And they said "it's over" millions of times. What they mean is the exponential expectations are done.

hodgehog11•1h ago

Not really, it's just that our benchmarks are not good at showing how they've improved. Those that regularly try out LLMs can attest to major improvements in reliability over the past year.

bertili•1h ago

GPT5 doesn't add any cues to whether we hit the wall, as OpenAI only needs to go one step beyond the competition. They are market leaders and more profitable than the others, so it's possible are not showing us everything they have, until they really need to.

demirbey05•1h ago

I mean test-time scaling coming to end, there are many open rooms for next thing.

cuuupid•3h ago

The silent victory here is this seems like it is being built to be faster and cheaper than o3 while presenting a reasonable jump, which is an important jump in scaling law

On the other hand if it's just getting bigger and slower it's not a good sign for LLMs

hirvi74•3h ago

Personally, I am more concerned about accuracy than speed.

onlyrealcuzzo•3h ago

Yeah, but OpenAI is concerned with getting on a path to making money, as their investors will eventually run out of money to light on fire, so...

smlacy•2h ago

Yeah, this very much feels like "we have made a more efficient/scalable model and we're selling it as the new shiny but it's really just an internal optimization to reduce cost"

reasonableklout•1h ago

Significant cost reduction while providing the same performance seems pretty big to me?

Not sure why a more efficient/scalable model isn't exciting

smlacy•1h ago

Oh it's exciting, but not as exciting when sama pumps GPT-5 speculation and the market thinks we're a stones throw away from AGI, which it appears we're not.

yahoozoo•3h ago

So the benchmark graphs they have shown so far in the stream appears to show that GPT-5 is WORSE than other models unless you use thinking?

RivieraKid•3h ago

Is it bad that I hope it's not a significant improvement in coding?

mirblitzarmaven•3h ago

Is it bad I quietly hope AI fails to live up to expectations?

hirvi74•3h ago

I am not sure that we are not presented with a Catch-22. Yes, life might likely be better for developers and other careers if AI fails to live up to expectations. However, a lot companies, i.e., many of our employers, have invested a lot of money in these products. In the event AI fails, I think the stretched rubber band of economics will slap back hard. So, many might end up losing their jobs (and more) anyway.

nemomarx•3h ago

Even if it takes off, they might have invested in the wrong picks or etc. If you think of the dot com boom the Internet was eventually a very successful thing, e commerce did work out, but there were a lot of losing horses to bet on.

RivieraKid•2h ago

If AI fails to continue to improve, the worst-case economic outcome is a short and mild recession and probably not even that.

Once sector of the economy would cut down on investment spending, which can be easily offset by decreasing the interest rate.

But this is a short-term effect. What I'm worried is a structural change of the labor market, which would be positive for most people, but probably negative for people like me.

coffeebeqn•57m ago

AI not sucking up 90% of all current investments? Sign me up to this world!

unsupp0rted•3h ago

Yes, it's bad. Because we're all dying of cancer, heart disease and auto-immune disease, not to mention traffic accidents and other random killers that AI could warn us about and fix.

I don't mind losing my programming job in exchange for being able to go to the pharmacy for my annual anti-cancer pill.

mirblitzarmaven•3h ago

Fair point on improvements outside of garbage generative AI.

But, what happens when you lose that programming job and are forced to take a job at a ~50-70% pay reduction? How are you paying for that anti-cancer drug with a job with no to little health insurance?

assword•2h ago

The usual answer to this question is that LLMs are on the verge of making Fully Automated Luxury Gay Space Communism a reality.

jplusequalt•2h ago

Which is completely detached from reality. Where are the social programs for this? Hell, we've spent the last 8 months hampering social systems, not bolstering them.

tecleandor•2h ago

I'd love that, but I have the feeling that Altman is not in that same page.

amarcheschi•2h ago

Or the funding for ai might have gone into curing cancer, heart disease, better research for urban planning, whatever that isn't ai

catigula•2h ago

You're afraid to die so we should reorder society to fail to prevent it because reasons.

jplusequalt•2h ago

>I don't mind losing my programming job in exchange for being able to go to the pharmacy for my annual anti-cancer pill

Have you looked at how expensive prescription drug prices are without (sometimes WITH) insurance? If you are no longer employed, good luck paying for your magical pill.

captainclam•2h ago

It's very easy to imagine a world where all these things are solved, but it is a worse world to live in overall.

I don't think it is "bad" to be sincerely worried that the current trajectory of AI progress represents this trade.

jrboyens•2h ago

I mean, that presumes that the answer to generating your anti-cancer pill, or the universal cure to heart disease has already been found, but humans can't see it because the data is disparate.

The likelihood of all that is incredibly slim. It's not 0% -- nothing ever really is -- but it is effectively so.

Especially with the economics of scientific research, the reproducibility crisis, and general anti-science meme spreading throughout the populace. The data, the information, isn't there. Even if it was, it'd be like Alzheimer's research: down the wrong road because of faked science.

There is no one coming to save humanity. There is only our hard work.

apwell23•2h ago

cancer is just aging . we all have to die somehow when its time to go.

How exactly do you wish death comes to you?

dsign•2h ago

Even if AI could help, it won’t in the current system. The current system which is throwing trillions into AI research on the incentive to replace expensive labor, all while people don’t have basic health insurance.

aeve890•38m ago

>Yes, it's bad. Because we're all dying of cancer, heart disease and auto-immune disease, not to mention traffic accidents and other random killers that AI could warn us about and fix.

Any disease cured/death avoided by AI yet?

monocasa•23m ago

There's rumors that ML played a part in the creation of the covid mRNA vaccines.

bluefirebrand•3h ago

No, it's not bad to hope that your industry and source of income isn't about to be gutted by corporations

lavezzi•2h ago

Sounds more like “I’m hoping it doesn’t eat my lunch”, but everyone else be damned.

bluefirebrand•38m ago

I hope it doesn't eat anyone's lunch

Earth for humans, not machines, not AI

amarcheschi•3h ago

Seeing the system card https://cdn.openai.com/pdf/8124a3ce-ab78-4f06-96eb-49ea29ffb...

there is some improvements in some benchs and not else worthy of note in coding. i only took a peek though so i might be wrong

arm32•3h ago

What's bad about not wanting to lose your job?

9rx•3h ago

You are losing your job either way. Either AI will successfully take it, or as you no doubt read in the article yesterday, AI is the only thing propping up the economy, so the jobs will also be cut in the fallout if AI fails to deliver.

thewebguyd•3h ago

Except one is recoverable from, just as we eventually recovered from dotcom. The other is permanent and requires either government intervention in the form of UBI(good luck with that), or a significant amount of the population retraining for other careers and starting over, if that's even possible.

But yeah, you are correct in that no matter what, we're going to be left holding the bag.

9rx•2h ago

> Except one is recoverable from, just as we eventually recovered from dotcom.

"Dotcom" was never recovered. It, however, did pave the way for web browsers to gain rich APIs that allowed us to deliver what was historically installed desktop software on an on-demand delivery platform, which created new work. As that was starting to die out, the so-called smartphone just so happened to come along. That offered us the opportunity to do it all over again, except this time we were taking those on-demand applications and turning them back into installable software just like in the desktop era. And as that was starting to die out COVID hit and we started moving those installable mobile apps, which became less important when people we no longer on the go all the time, back to the web again. As that was starting to die out, then came ChatGPT and it offered work porting all those applications to AI platforms.

But if AI fails to deliver, there isn't an obvious next venue for us to rebuild the same programs all over yet again. Meta thought maybe VR was it, but we know how that turned out. More likely in that scenario we will continue using the web/mobile/AI apps that are already written henceforth. We don't really need the same applications running in other places anymore.

There is still room for niche applications here and there. The profession isn't apt to die a complete death. But without the massive effort to continually port everything from one platform to another, you don't need that many people.

RivieraKid•2h ago

The idea that AI is somehow responsible for a huge chunk of software development demand is ridiculous. The demand for software has a very diverse structure.

RivieraKid•2h ago

Exactly. A slowdown in AI investment spending would have a short-term and tiny effect on the economy.

I'm not worried about the scenario in which AI replaces all jobs, that's impossible any time soon and it would probably be a good thing for the vast majority of people.

What I'm worried about is a scenario in which some people, possibly me, will have to switch from a highly-paid, highly comfortable and above-average-status jobs to jobs that are below avarage in wage, comfort and status.

coffeebeqn•55m ago

There are plenty of places in the economy that could use that investment money productively

singpolyma3•3h ago

Yes

myahio•2h ago

Today might be your lucky day then

RivieraKid•1h ago

Dodged the bullet.

sophia01•3h ago

API usage requires organization verification with your ID :(.

fullstackwife•3h ago

Does that even work? it required passport, personal details, what else?

sophia01•2h ago

Driver license and selfies. Also still not available in API after doing that! Edit: I do have access now via API.

croemer•3h ago

They claim it thinks the "perfect amount" but there is no perfect amount. It all depends on willingness to pay, latency tolerance, etc.

uponasmile•3h ago

The dev blog makes it sound like they’re aiming more for “AI teammate” than just another upgrade. That said, it’s hard to tell how much of this is real improvement vs better packaging. Benchmarks are cherry-picked as usual, and there’s not much comparison to other models. Curious to hear how it performs in actual workflows.

doctoboggan•3h ago

Watching the livestream now, the improvement over their current models on the benchmarks is very small. I know they seemed to be trying to temper our expectations leading up to this, but this is much less improvement than I was expecting

lawlessone•3h ago

im sure i am repeating someone else but sounds like we're coming over the s-curve

Bluestein•3h ago

My thought exactly.-

Diminished returns.-

... here's hoping it leads to progress.-

og_kalu•3h ago

I mean that's just the consequence of releasing a new model every couple months. If Open AI stayed mostly silent since the GPT-4 release (like they did for most iterations) and only now released 5 then nobody would be complaining about weak gains in benchmarks.

moduspol•3h ago

Well it was their choice to call it GPT 5 and not GPT 4.2.

og_kalu•2h ago

It is significantly better than 4, so calling it 4.2 would be rather silly.

amilios•2h ago

Is it? That's not super obvious from the results they're showing.

og_kalu•2h ago

Yes it is, if we're talking about the original GPT-4 release or even GPT-4o. What about the results they've shown is not obvious?

jononor•2h ago

If everyone else had stayed silent as well, then I would agree. But as it is right now they are juuust about managing to match the current pace of the other contenders. Which actually is fine, but they have previously set quite high expectations. So some will probably be disappointed at this.

wahnfrieden•3h ago

It is at least much cheaper and seems faster.

They also announced gpt-5-pro but I haven't seen benchmarks on that yet.

doctoboggan•2h ago

I am hoping there is a "One more thing" that shows the pro version with great benchmark scores

z7•2h ago

GPT-5 is #1 on WebDev Arena with +75 pts over Gemini 2.5 Pro and +100 pts over Claude Opus 4:

https://lmarena.ai/leaderboard

virgildotcodes•2h ago

This same leaderboard lists a bunch of models, including 4o, beating out Opus 4, which seems off.

Workaccount2•2h ago

Sam said maybe two years ago that they want to avoid "mic drop" releases, and instead want to stick to incremental steps.

This is day one, so there is probably another 10-20% in optimizations that can be squeezed out of it in the coming months.

bigmadshoe•2h ago

Then why increment the version number here? This is clearly styled like a "mic drop" release but without the numbers to back it up. It's a really bad look when comparing the crazy jump from GPT3 to GPT4 to this slight improvement with GPT5.

Workaccount2•2h ago

Because it is a 100x training compute model over 4.

GPT5.5 will be a 10X compute jump.

4.5 was 10x over 4.

bigmadshoe•2h ago

Even worse optics. They scaled the training compute by 100x and got <1% improvement on several benchmarks.

reasonableklout•1h ago

Is 1% relative to more recent models like o3, or the (old and obsolete at this point) GPT-4?

dpoloncsak•2h ago

Honestly, I think the big thing is the sycophancy. It's starting to reach the mainstream that ChatGPT can cause people to 'go crazy'.

This gives them an out. "That was the old model, look how much better this one tests on our sycophancy test we just made up!!"

camdenreslink•1h ago

GPT-5 was highly anticipated and people have thought it would be a step change in performance for a while. I think at some point they had to just do it and rip the bandaid off, so they could move past 5.

brokencode•1h ago

The fact that it unifies the regular model and the reasoning model is a big change. I’m sure internally it’s a big change, but also in terms of user experience.

I feel it’s worthy of a major increment, even if benchmarks aren’t significantly improved.

yahoozoo•1h ago

He said that because even then he saw the writing on the wall that LLMs will plateau.

827a•2h ago

I have a suspicion that while the major AI companies have been pretty samey and competing in the same space for a while now, the market is going to force them to differentiate a bit, and we're going to see OpenAI begin to lose the race toward extremely high levels of intelligence instead choosing to focus on justifying their valuations by optimizing cost and for conversational/normal intelligence/personal assistant use-cases. After all, most of their users just want to use it to cheat at school, get relationship advice, and write business emails. They also have Ive's company to continue investing in.

Meanwhile, Anthropic & Google have more room in their P/S ratios to continue to spend effort on logarithmic intelligence gains.

Doesn't mean we won't see more and more intelligent models out of OpenAI, especially in the o-series, but at some point you have to make payroll and reality hits.

juped•2h ago

I think this is pretty much what we've already seen happening, in fact.

anyg•2h ago

Also, the code demos are all using GPT-5 MAX on Cursor. Most of us will not be able to use it like that all the time. They should have showed it without MAX mode as well

hodgehog11•1h ago

The hallucination benchmarks did show major improvement. We know existing benchmarks are nearly useless at this point. It's reliability that matters more.

jama211•1h ago

I’m more worried about how they still confidently reason through things incorrectly all the time, which isn’t quite the same as hallucination, but it’s in a similar vein.

davidhs•10m ago

> I know they seemed to be trying to temper our expectations leading up to this

Before the release of the model Sam Altman tweeted a picture of the Death Star appearing over the horizon of a planet.

diogolsq•9m ago

Law of diminishing returns.

We’re talking about less than a 10% performance gain, for a shitload of data, time, and money investment.

yahoozoo•3h ago

The benchmarks in the stream appears to show that GPT-5 performs WORSE than other models unless you enable thinking?

AnimalMuppet•3h ago

Um... if I want an intelligence, when would I not want it to think?

yahoozoo•2h ago

I mean, I don’t disagree. Why even bother with a non-thinking mode?

tomas789•3h ago

What surprises me the most is that there is no benchmarks table right at the top. Maybe the improvements are not to call home about?

Workaccount2•3h ago

OpenAI taking a page out of Apple's book and only comparing against themselves

bigyabai•3h ago

Presumably because GLM 4.5 or Qwen3 comparisons would clobber them in eval scores.

quotemstr•3h ago

And don't require KYC crap to predict next token

conradkay•1h ago

You can check the same evals OpenAI used for those models

Hint: unclobbered

hobofan•3h ago

Anthropic has shut them off from API access, so the most interesting comparison wouldn't be there anyways.

hodgehog11•1h ago

Unlike Apple, OpenAI doesn't have nearly the same moat. The Chinese labs are going to eat their lunch at this rate.

wgjordan•3h ago

Note it's not available to everyone yet:

> GPT-5 Rollout

> We are gradually rolling out GPT-5 to ensure stability during launch. Some users may not yet see GPT-5 in their account as we increase availability in stages.

FabHK•3h ago

But available from today to free tier. Yay.

km144•3h ago

How would I even know? I haven't seen which model of ChatGPT I'm using on the site ever since they obfuscated that information at some point.

Kurtz79•3h ago

"what model are you?"

ChatGPT said: You're chatting with ChatGPT based on the GPT-4o architecture (also known as GPT-4 omni), released by OpenAI in May 2024.

pjerem•3h ago

Actually this trick have been proven to be useless in a lot of cases.

LLMs don’t inherently know what they are because "they" are not themselves part of the training data.

However, maybe it’s working because the information is somewhere into their pre-prompt but if it wasn’t, it wouldn’t say « I don’t know » but rather hallucinate something.

So maybe that’s true but you cannot be sure.

efilife•2h ago

It's injected into their system prompt

seba_dos1•2h ago

...which is useless when the model gets changed in-between responses.

dpoloncsak•2h ago

If you believe 'leaked system prompts', it tends to be part of the system prompt.

I believe most of these came from asking the LLMs, and I don't know if they've been proven to not be a hallucination.

https://github.com/jujumilk3/leaked-system-prompts

umanwizard•3h ago

Hmmm? I have a dropdown showing which model I'm using right there on chat.com

https://i.imgur.com/X0MQNIH.png

thepasswordis•3h ago

The model should appear as a drop down at the top of the page.

manojlds•2h ago

What do you mean? It's front and center

noahbp•2h ago

If you can't see it, you're likely on the free tier and using the latest mini model.

cootsnuck•1h ago

Not true. I've been a paid user forever and on the Android app they have definitely obscured the model selector. It's readily visible to me on desktop / desktop browser. But on the Android app the only place I can find it is if I click on an existing response already sent by chatGPT and then it gives me the option to re-generate the message with a different model.

And while I'm griping about their Android app, it's also very annoying to me that they got rid of the ability to do multiple, subsequent speech-to-text recordings within a single drafted message. You have to one-shot anything you want to say, which would be fine if their STT didn't sometimes failed after you've talked for two minutes. Awful UX. Most annoying is that it wasn't like that originally. They changed it to this antagonistic one-shot approach a several months ago, but then quickly switched back. But then they did it again a month or so ago and have been sticking with it. I just use the Android app less now.

setsewerd•34m ago

Sounds like there are a lot of frustrations here but as a fellow android user just wanted to point out that you can tap the word ChatGPT in your chat (top left) and it opens the model selector.

Although if they replace it all with gpt5 then my comment will be irrelevant by tomorrow

macNchz•30m ago

On desktop at least the model selector only shows GPT-5 for me now, with Pro and Thinking under "Other Models" but no other options.

sejje•10m ago

When you start a new conversation it says "chatGPT" at the top. Tap that to select a model.

For the multiple messages, I just use my keyboard's transcription instead of openai's.

jhickok•2h ago

Weird. On the homepage for GPT-5 it says "Available to everyone."

nobodywillobsrv•1h ago

This is one of these "best efforts" but also "lying a bit in marketing" is ok I guess.

On bad days this really bothers me. It's probably not the biggest deal I guess but somehow really feels like it pushes us all over the edge a bit. Is there a post about this phenomena? It feels like some combination of bullying, gaslighting and just being left out.

cloudfudge•15m ago

Yeah, and on the models page, everything else is labeled as deprecated. So as a paid user, I don't have access to anything that's not deprecated. Great job, guys.

Not the end of the world, but this messaging is asinine.

minimaxir•1h ago

I am seeing it now in the Playground backend.

andybak•3h ago

Not live for me in the UK. "Try it in ChatGPT" takes me to the normal page and there's no v5 listed in the dropdown.

SilasX•1h ago

I just got the same thing in the US too. (Am on the $20/month subscription.)

thegeomaster•3h ago

SWE-Bench Verified score, with thinking, ties Opus 4.1 without thinking.

AIME scores do not appear too impressive at first glance.

They are downplaying benchmarks heavily in the live stream. This was the lab that has been flexing benchmarks as headline figures since forever.

This is a product-focused update. There is no significant jump in raw intelligence or agentic behavior against SOTA.

byyoung3•3h ago

they aren't downplaying anything.

wouldbecouldbe•3h ago

Disclaimer -> We are not a doctor or health advice, marketing -> More useful health answers

mtlynch•3h ago

What's going on with their SWE bench graph?[0]

GPT-5 non-thinking is labeled 52.8% accuracy, but o3 is shown as a much shorter bar, yet it's labeled 69.1%. And 4o is an identical bar to o3, but it's labeled 30.8%...

[0] https://i.postimg.cc/DzkZZLry/y-axis.png

nonhaver•3h ago

also wondering this. had to pause the livestream to make sure i wasnt crazy. definitely eyebrow raising

bwestergard•3h ago

"GPT-5, please generate a slideshow for your launch presentation."

Bluestein•3h ago

"Dang it! Claude!, please ..."

croemer•3h ago

The barplot is wrong, the numbers are correct. Looks like they had a dummy plot and never updated it, only the numbers to prevent leaking?

Screenshot of the blog plot: https://imgur.com/a/HAxIIdC

hnuser123456•2h ago

Haha, even with that, it says 4o does worse with 2 passes than with 1.

Edit: Nevermind, just now the first one is SWE-bench and 2nd is aider.

croemer•2h ago

Those are different benchmarks

hnuser123456•2h ago

I see now on the website, the screenshot cut off the header for the first benchmark, looked like it was just comparing 1-pass and 2-pass.

croemer•2h ago

Yes, sorry didn't fit everything on the screenshot.

drmidnight•3h ago

GPT-5 generated the chart

lacoolj•2h ago

Best answer on this page.

Thanks for the laugh. I needed it.

Upvoter33•3h ago

Tufte used to call this creating a "visual lie" - you just don't start the y-axis at 0, you start it wherever, in order to maximize the difference. it's dishonest.

amarcheschi•3h ago

52 above 60 seems wrong whatever way you put it

arjie•3h ago

Must be some sort of typo type thing in the presentation since the launch site has it correct here https://openai.com/index/introducing-gpt-5/#:~:text=Accuracy...

Look at the image just above "Instruction following and agentic tool use"

mcs5280•3h ago

They vibecharted

Bluestein•3h ago

New term of art :)

netule•2h ago

This reminds me of the agent demo's MLB stadium map from a few weeks ago: https://youtu.be/1jn_RpbPbEc?t=1435 (at timestamp)

Completely bonkers stuff.

datadrivenangel•2h ago

stable diffusion is great for this!

Aurornis•3h ago

As someone who spent years quadruple checking every figure in every slide for years to avoid a mistake like this, it’s very confusing to see this out of the big launch announcement of one of the most high profile startups around.

Even the small presentations we gave to execs or the board were checked for errors so many times that nothing could possibly slip through.

mixologic•2h ago

They let the AI make the bars.

varispeed•2h ago

and then check.

jama211•1h ago

Well, clearly they didn’t

kridsdale3•2h ago

Vibegraphing.

datadrivenangel•2h ago

Stable diffusion is good for this!

croemer•2h ago

Yes this is quite shocking. They could have just had o3 fact check the slides and it would have noticed...

abirch•2h ago

o3 did fact check the slides and it fixed its lower score.

throwaway0123_5•2h ago

I thought so too, but I gave it a screenshot with the prompt:

> good plot for my presentation?

and it didn't pick up on the issue. Part of its response was:

> Clear metric: Y-axis (“Accuracy (%), pass @1”) and numeric labels make the performance gaps explicit.

I think visual reasoning is still pretty far from text-only reasoning.

ertgbnm•2h ago

It's literally a billion dollar plus release. I get more scrutiny on my presentations to groups of 10 people.

dbg31415•2h ago

I take a strange comfort in still spotting AI typos. Makes it obvious their shiny new "toy" isn't ready to replace professionals.

They talk about using this to help families facing a cancer diagnosis -- literal life or death! -- and we're supposed to trust a machine that can't even spot a few simple typos? Ha.

The lack of human proofreading says more about their values than their capabilities. They don't want oversight -- especially not from human professionals.

nine_k•2h ago

Cynically, the AI is ready to replace professionals, in areas where the stakeholders don't care too much. They can offer the services cheaper, and this is all that matters to their customers. Were it not so, companies like Tata won't have any customers. The phenomenon of "cheap Chinese junk" would not exist, because no retailer would order to produce it.

So, brace yourselves, we'll see more of this in production :(

dvfjsdhgfv•1h ago

Well, the world will split into those who care, and fields where precision is crucial, and the rest. Occasional mistakes are tolerable but systematic bullshit is a bit too much for me.

nine_k•42m ago

This separation (always a spectrum, not a split) already exists for a long time. Bouts of systemic bullshit occur every now and then, known as "bubbles" (as in dotcom bubble, mortgage bubble, etc) or "crises" (such as "reproducibility crisis", etc). Smaller waves rise and fall all the time, in the form of various scams (from the ancient tulip mania to Ponzi to Madoff to ICOs, etc).

It seems like large amounts of people, including people at high-up positions, tend to believe bullshit, as long as it makes them feel comfortable. This leads to various irrational business fashions and technological fads, to say nothing of political movements.

So yes, another wave of fashion, another miracle that works "as everybody knows" would fit right in. It's sad because bubbles inevitably burst, and that may slow down or even destroy some of the good parts, the real advances that ML is bringing.

saati•40m ago

Does something where you don't care about quality this much need doing at all?

renewiltord•2h ago

I'm just going to wildly speculate.

1. They had many teams who had to put their things on a shared Google Sheets or similar

2. They used placeholders to prevent leaks

2.a. Some teams put their content just-in-time

3. The person running the presentation started the presentation view once they had set up video etc. just before launching stream

4. Other teams corrected their content

5. The presentation view being started means that only the ones in 2.a were correct.

Now we wait to see.

bigyabai•2h ago

6. (Occam's Razor) It just didn't perform that well in trials for that specific eval.

renewiltord•2h ago

That is obviously wrong since the numbers are right but the graph is wrong and you can see it correct on the website…

alfalfasprout•2h ago

Probably generated with GPT-5

smartmic•2h ago

The needle now presses a little deeper into the bubble.

blitzar•2h ago

It wouldnt have taken years of quadruple checks to spot that one.

everfrustrated•2h ago

Possibly they rushed to bring forward the release annoucement

real_marcfawzi•2h ago

Humans hallucinate output all the time.

rafark•1h ago

Not as much as current llms. But the point is that AIs are supposed to be better than us, kind of how people built calculators to be more reliable than the average person and faster than anyone.

maldonad0•2h ago

It's not a mistake. It's meant to misled.

nicce•1h ago

It is not mistake. It is common tactic to make illusion of improvement.

dvfjsdhgfv•1h ago

Would they risk such an obvious blunder and being ridiculed for being "AI-sloppy"? I don't believe it.

nicce•54m ago

I don’t believe for mistake either. As others have said, these graphs are worth of billions. Everything is calculated. They take the risk that some will notice but most will not. They say that it is mistake for those who notice.

crmi•18m ago

Perhaps they're taking a leaf from nvidias book - influencers dunking on their bar charts gives a lot of free press coverage/mindshare

MrNeon•35m ago

I've seen that sentiment on reddit as well and I can't phantom how you think it being on purpose is more likely than a mistake when

1 - The error is so blatantly large

2 - There is a graph without error right next to it

3 - The errors are not there in the system card and the presentation page

achrono•12m ago

I think this just further demonstrates the truth behind the truly small & scrappy teams culture at OpenAI that an ex-employee recently shared [1].

Even with the way the presenters talk, you can sort of see that OAI prioritizes speed above most other things, and a naive observer might think they are testing things a million different ways before releasing, but actually, they're not.

If we draw up a 2x2 for Danger (High/Low) versus Publicity (High/Low), it seems to me that OpenAI sure has a lot of hits in the Low-Danger High-Publicity quadrant, but probably also a good number in the High-Danger Low-Publicity quadrant -- extrapolating purely from the sheer capability of these models and the continuing ability of researchers like Pliny to crack through it still.

[1] https://calv.info/openai-reflections

bhouston•3h ago

Sounds like a graph that was generated via AI. :)

tacker2000•2h ago

Wow imgur has gone to shit. I open the image on mobile and then try to zoom it and bam some other “related content” is opened…!

jama211•1h ago

Yeah it’s basically unusable now

yz-exodao•2h ago

Also, what's this??? https://imgur.com/a/5CF34M6

jasonjmcghee•2h ago

Deception - guessing it's % of responses that deceived the user / gave misleading information

yz-exodao•2h ago

Sure, but 50.0 > 47.4...

godelski•1h ago

In everything except the first set of bars, bigger bar == bigger number.

But also scale is really off... I don't think anything here is proportionally correct even within the same grouping.

croemer•2h ago

Imgur is down, hug of death from screenshot links on HN.

  {"data":{"error":"Imgur is temporarily over capacity. Please try again later."},"success":false,"status":403}

Or rate limited.

koolala•2h ago

stats say this image got 500 views. imgur is much much more populated than HN

superkuh•2h ago

In 2015, yes. In 2025? Probably not. Imgur is enshittifying rapidly since reddit started it's own image host. Lots of censorship and corporate gentrification. There's still some hangers on but it's a small group. 15 comments on imgur is a lot nowadays.

Anon1096•2h ago

This is what Imgur shows to blacklisted IPs. You probably have a VPN on that is blocked.

croemer•29m ago

Ugh, why lie to users... Just say the IP is blacklisted.

Thanks for the tip btw.

hk__2•20m ago

Because when you know it’s blacklisted you might try with a different IP, whereas if you don’t you will just wait (forever).

card_zero•1h ago

https://i.postimg.cc/mrF87xpQ/YMADeqH.jpg

dvfjsdhgfv•1h ago

Lol this is pure vibegraphing!

clolege•2h ago

Not GPT-5 trying to deceive us about how deceptive it is?

therein•2h ago

Why would you think it is anything special? Just because Sam Altman said so? The same guy who told us he was scared of releasing GPT-2.5 but now calling its abilities "toddler/kindergarten" level?

CamperBob2•52m ago

If you were one of the very first people to see an LLM in action, even a crappy one, and you didn't have second thoughts about what you were doing and how far things were going to go, what would that say about you?

mikert89•2h ago

AGI is launching, lets complain about the charts

amarcheschi•2h ago

Any time now

Mawr•2h ago

Don't ask questions, just consume product.

seydor•2h ago

someone copy pasted the 3rd bar to the 2nd

mbowcut2•2h ago

it looks like the 2nd and 3rd bar never got updated from the dummy data placeholders lol.

edwinarbus•2h ago

cross-posting:

https://x.com/sama/status/1953513280594751495 "wow a mega chart screwup from us earlier--wen GPT-6?! correct on the blog though."

blog: https://openai.com/index/introducing-gpt-5/

18172828286177•2h ago

Probably generated by an LLM

anigbrowl•1h ago

(whispers) they're bullshit artists

It's like those idiotic ads at the end of news articles. They're not going after you, the smart discerning logician, they're going after the kind of people that don't see a problem. There are a lot of not-smart people and their money is just as good as yours but easier to get.

hansmayer•1h ago

Exactly this, but it will still be a net negative for all of us. Why? Increasingly I have to argue with non-technical geniuses who have "checked" some complex technical issue with ChatGPT, they themselves lacking even the basic foundations in computer science. So you have an ever increasing number of smartasses who think that this technology finally empowers them. Finally they get "level up" with that arrogant techie. And this will ultimately doom us, because as we know, idiots are in majority and they often overrule the few sane voices.

Ameo•3h ago

$10 per million output tokens, wow

asadm•3h ago

74.9 on SWE-bench verified

88.0 on Aider Polygot

not bad i guess

croemer•3h ago

The presentation asks for a moving svg to illustrate Bernoulli, that's suspiciously close to a Pelican.

losvedir•3h ago

Wait, isn't the Bernoulli effect thing they're demoing now wrong? I thought that was a "common misconception" and wings don't really work by the "longer path" that air takes over the top, and that it was more about angle of attack (which is why planes can fly upside down).

It seems like it's actually an ideal "trick" question for an LLM actually, since so much content has been written about it incorrectly. I thought at first they were going to demo this to show that it knew better, but it seems like it's just regurgitating the same misleading stuff. So, not a good look.

twixfel•3h ago

That's what I thought. Aeroplanes don't fly because of the Bernoulli effect:

https://physics.stackexchange.com/questions/290/what-really-...

Apparently. Not that I know either way.

QuantumGood•3h ago

All things that create lift, lift the wings—and you need them all for efficient flight. The Bernoulli effect is one thing, but does not produce the main lift force in many circumstances.

wongarsu•2h ago

Aircraft with symmetrical wings fly just fine, and most aircraft can fly upside down. So you don't need the Bernoulli effect. Exploiting all the effects gives you more efficient planes though

dismalaf•2h ago

You need a lot of power to lift an aircraft without the Bernoulli effect. That's why all planes take advantage of it.

tucnak•1h ago

About 20% more power, provided perfect conversion. A lot? You tell me!

Animats•2h ago

Mandatory XKCD: [1]

[1] https://xkcd.com/803/

dguest•52m ago

also discussed here https://news.ycombinator.com/item?id=40835223

dataflow•3h ago

Relevant: https://xkcd.com/803/

That said, I recall reading somewhere that it's a combination of effects, and the Bernoulli effect contributes, among many others. Never heard an explanation that left me completely satisfied, though. The one about deflecting air down was the one that always made sense to me even as a kid, but I can't believe that would be the only explanation - there has to be a good reason that gave rise to the Bernoulli effect as the popular explanation.

And you can tell that effect makes some sense of you hold a sheet of paper and blow air over it - it will rise. So any difference in air speed has to contribute.

amilios•2h ago

I believe the deflection is the high-level explanation. Things like the Bernoulli effect and the air on the top of the airfoil travelling faster (it does -- far faster than the equal transit time theory implies actually), are the "instantiation" or outcomes of the air deflection. This is my understanding. Hence airplanes can fly upside down because even if the airfoil is upside down, it's still deflecting the air, just perhaps less efficiently (I think it's true that planes flying upside down need a more extreme angle of attack to maintain lift, so this makes sense)

adgjlsfhk1•2h ago

the problem is that the "real" explanation is "solve navier stokes on the wing". everything else is just trying to build semi-reliable intuition.

semi-extrinsic•1h ago

What is just plain wrong is the equal transit time thing, people saying that air on both sides of the wing have to take the same time to pass it.

The Bernoulli effect as a separate entity is really a result of (over)simplification, but it's not wrong. You need to solve the Navier-Stokes equations for the flow around the wing, but there are many ways to simplify this - from CFD at different resolutions, via panel methods and potential theory, to just conservation of energy (which is the Bernoulli equation). So it gets popularized because it's the most simplified model.

To give an analogy, you can think of all CPUs as a von Neumann architecture. But the reality is that you have a hugely complicated thing with stacks, multiple cache levels, branch predictors, specex, yada yada.

On the very fundamental level, wings make air go down, and then airplane goes up. Just like you say. By using a curved airfoil instead of a flat plate, you can create more circulation in the flow, and then because of the way fluids flow you can get more lift and less drag.

carabiner•1h ago

Imagine an airfoil with a super tall square block on top of it. Due to equal transit time, the particles must accelerate to relativistic speeds to reach the end to rejoin the lower surface particles, when I point a house fan at it. We have created a magical flow accelerator!

nicetryguy•3h ago

Yeah, they sure clicked away from it very fast and kept adjusting the scrollbars. It was confusing what it was trying to display. Furthermore, the prompt contained "Canvas" and "SVG" while as someone with webdev experience these are certainly familiar concepts, i wouldn't consider those in the "casual lexicon" for a random user trying to help a middle schooler with homework. I'm not impressed...

IMO Claude 3.7 could have done a similar / better job with that a year ago.

DominikPeters•30m ago

Claude 3.7 was released in February 2025.

MichaelZuo•14m ago

Seems like crazy incompetence, I’m sure at least the top quintile of my junior year fluid dynamics class could notice it was fishy within a 15 minute meeting… probably more than half could.

Lichtso•2h ago

https://www.youtube.com/watch?v=CT5oMBN5W5M

SkyPuncher•2h ago

That Bernoulli effect thing was a complete fail. It didn't do anything to demonstrate the actual concept. It didn't work how they expected, at all.

I know that it's rather hard for them to demo the deep reasoning, but all of the demos felt like toys - rather that actual tools.

nwienert•1h ago

Their code example underwhelmed too, the first one started out with 2/X progress, all of them looked terrible, third didn't have mouse icon.

nullbyte•22m ago

I thought the UI of the french learning app was very nice

Mali-•1h ago

The last part of GPT's answer does say: "Bernoulli's effect works alongside Newton's Third Law - the wing pushes air downward [...] - so the lift isn't only Bernoulli..."

According to this answer on physics stackexchange, Bernoulli accounts for 20% of the lift, so GPT's answer seems about right: https://physics.stackexchange.com/a/77977

I hope any future AI overlords see my charity

maltsev•3h ago

gpt-5 is now #1 at LMArena: https://lmarena.ai/leaderboard/text

b800h•3h ago

This livestream is atrocious

CjHuber•3h ago

It says out now in chatgpt. Did anyone yet hit the usage limits to report back how many messages are possible?

croemer•3h ago

I don't see it in my model picker yet.

cancerboi•1h ago

yeah I don't get it - I am pro subscriber and I can not pick it...

oof-baroomf•3h ago

74.9 SWEBench. This increases the SOTA by a whole .4%. Although the pricing is great, it doesn't seem like OpenAI found a giant breakthrough yet like o1 or Claude 3.5 Sonnet

Workaccount2•2h ago

I'm pretty sure 3.5 sonnet always benchmarked poorly, despite it being the clear programming winner of it's time.

spruce_tips•3h ago

These presenters all give off such a “sterile” vibe

CamelCaseName•3h ago

Hundreds of billions on the line, really can't risk anything

mhh__•3h ago

this is just the way that american middle and upper classes are going. This kind of language/vibe is the default outside of a specific type of WASP IME at least.

diggan•3h ago

Not even 10 seconds after I started watching the stream, someone said how much more human GPT-5 is, while the people sitting and talking about it don't seem human at all, and it's not an accent/language thing. Seems they're strictly following a dialogue script that is trying to make them seem "impromptu" but the acting isn't quite there for that :)

Bluestein•2h ago

One heck of a Turing test itself if I've ever seen one.-

MattSayar•2h ago

Presenting is hard

AnimalMuppet•2h ago

Presenting where you have to be exactly on the content with no deviation is hard. To do that without sounding like a robot is very hard.

Presenting isn't that hard if you know your content thoroughly, and care about it. You just get up and talk about something that you care about, within a somewhat-structured outline.

Presenting where customers and the financial press are watching and parsing every word, and any slip of the tongue can have real consequences? Yeah, um... find somebody else.

HaZeust•2h ago

One woman who went through her calendar with GPT had good acting that the GPT reply helped her find impromptu information (an email she needed to answer), and someone staged GPT-5 to make a French-learning website lander - which butchered its own design in the second run; but that's all the good acting for a "candid presentation" that I could find.

nilsherzig•2h ago

It created a webapp called „le chat“ hahah

HaZeust•2h ago

I laughed my ass off immediately after it gave that output, until the presenter made clear that it was a flash card for learning the words, "the cat" in French - and backed it up.

Insanity•2h ago

I don’t blame them, they aren’t actors. And yes, it’s clearly not impromptu, but I am trying to not let that take away from the message they are communicating. :)

jazzyjackson•2h ago

I use LLMs to get answers to queries but I avoid having conversations with them because I'm aware we pick up idiosyncrasies and colloquialisms from everyone we interact with. People who spend all day talking to thier GPT-voice will adjust their speaking style to be more similar to the bot.

I developed this paranoia upon learning about The Ape and the Child where they raised a chimp alongside a baby boy and found the human adapted to chimp behavior faster than the chimp adapted to human behavior. I fear the same with bots, we'll become more like them faster than they'll become like us.

https://www.npr.org/sections/health-shots/2017/07/25/5385804...

0x457•3h ago

It gives me elementary school oral report. The same level of acting and script.

wavemode•3h ago

It's because they have a script but are bad at acting.

Would've been better to just do a traditional marketing video rather than this staged "panel" thing they're going for.

christina97•3h ago

If the presenter is less human the LLM appears more human in comparison.

polotics•2h ago

at least no one is going for the infamous vocal fry :-D

pyb•3h ago

They look nervous, messing this presentation up could cost them their high-paying jobs.

motoxpro•2h ago

They are researchers, not professional presenters. I promise you if I told you to do a live demo, on stage, for 20 minutes, going back and forth between scripted and unscripted content, to an audience of at least 50 million people, that unless you do this a lot, you would do the same or worse. I know this because this is what I do for a living. I have seen 1000s of "normal" people be extremely awkward on stage. Much more so than this.

It's super unfortunate that, becasue we live in the social media/youtube era, that everyone is expected to be this perfect person on camera, because why wouldn't they be? That's all they see.

I am glad that they use normal people who act like themselves rather than them hiring actors or taking researchers away from what they love to do and tell them they need to become professional in-front-of-camera people because "we have the gpt-5 launch" That would be a nightmare.

It's a group of scientists sharings their work with the world, but people just want "better marketing" :\

mhh__•2h ago

Well yes I think part of the reason it's slightly unnerving is that this actually how they act irl. Sometimes people need a bit of edge to 'em!

efilife•2h ago

Maybe they are just nervous with half of the world looking at them?

drexlspivey•2h ago

But why would you want to put researchers in a marketing video? It’s not like they are explaining something deep.

motoxpro•2h ago

It's better marketing and more credible to have the researcher say "We think GPT 5 is the best model for developers, we used it extensively internally. Here let me give you an example..." than it is for Matthew McConaughey to say the same.

retsibsi•2h ago

I think they're copping this criticism because it's neither one thing nor the other. If it was really just a group of scientists being themselves, some of us would appreciate that. And if it was inauthentic but performed by great actors, most people wouldn't notice or care about the fakeness. This is somewhere in the middle, so it feels very unnatural and a bit weird.

motoxpro•2h ago

You're describing low skilled presenters. That is what it looks like when you put someone up in front of a camera and tell them to communicate a lot of information. You're not thinking about "being yourself," you're thinking about how to not forget your lines, not mess up, not think about the different outcomes of the prompt that you might have to deal with, etc.

This was my point. "Being yourself" on camera is hard. This comes across, apparently shockingly, as being devoid of emotion and/or robotic

retsibsi•2h ago

Yeah, but I disagree with you a bit. If it were less heavily scripted, it may or may not be going well, but it would feel very different from this and would not be copping the same criticisms. Or if they unashamedly leant into the scriptedness and didn't try to simulate normal human interaction, they would be criticised for being "wooden" or whatever, but it wouldn't have this slightly creepy vibe.

motoxpro•2h ago

I get you.

I think for me, just knowing what is probably on the teleprompter, and what is not, I am willing to bet a lot of the "wooden" vibe you are getting is actually NOT scripted.

There is no way for people to remember that 20 minutes of dialog, so when they are not looking at the camera, that is unscripted, and viceversa.

taytus•2h ago

Extremely robotic.

twixfel•2h ago

They shouldn't be presenting if they can't present.

"Minimal reasoning means that the reasoning will be minimal..."

Jakub Pachocki at the end is probably one of the worst public speakers I've ever seen. It's fine, it's not his mother tongue, and public speaking is hard. Why make him do it then?

wasabi991011•2h ago

You are acting like there aren't hundreds of well-preserved talks given at programming conferences every year, or that being a good presenter is not a requirement in academic research.

Also, whether OpenAI is a research organization is very much up for debate. They definitely have the resources to hire a good spokesperson if they wanted.

motoxpro•2h ago

I don't know how many conferences you have been to but most talks are painfully bad. The ones that get popular are the best and by people who love speaking, hence why you are seeing them speak (selection bias at it's finest)

They do have the resources (see WWDC), the question is if you want to take your technical staff of of their work for the amount of time it takes to develop the skill

0x7cfe•2h ago

I don't know. Maybe I'm biased, but Elon and his teammates' presentations do seem natural to me. Maybe a bit goofy but always on point nevertheless.

motoxpro•2h ago

Totally. I mean at this point Elon has 1000s of hours of practice doing interviews, pitches, presentations, conferences, etc. See Sam Altman in this context.

seydor•2h ago

researchers should need to be tortured like this. But maybe if they are paid so much, they should

pxc•1h ago

It seemed like good performances from people whose main skillset is not this.

For me, it's knowing what we know about the company and its history that gave a eerie feeling in combination with the sterility.

When they brought on the woman who has cancer, I felt deeply uncomfortable. My dad also has cancer right now. He's unlikely to survive. Watching a cancer patient come on to tell their story as part of an extended advertisement, expression serene, any hint of discomfort or pain or fear or bitterness completely hidden, ongoing hardship acknowledged only with a few shallow and euphemistic words, felt deeply uncomfortable to me.

Maybe this person enthusiastically volunteered, because she feels happy about what her husband is working on, and grateful for the ways that ChatGPT has helped her prepare for her appointments with doctors. I don't want to disrespect or discredit her, and I've also used LLMs alongside web searches in trying to formulare questions about my father's illness, so I understand how this is a real use case.

But something about it just felt wrong, inauthentic. I found myself wondering if she or her husband felt pressured to make this appearance. I also wondered if this kind of storytelling was irresponsible or deceptive, designed to describe technically responsible uses of LLMs (preparing notes for doctor's visits, where someone will verify the LLM's outputs against real expertise), but to suggest in every conceivable implicit way that these ChatGPT is actually capable of medical expertise itself. Put alongside "subject-matter experts in your pocket", talk of use in medical research and practice (where machine learning has a dubious history of deception and methodological misapplication problems), what are people likely to think?

I thought also of my mom, who drives herself crazy with anxiety every time my dad gets a new test result, obsessively trying to directly interpret them herself from the moment they arrive to his doctor's visit a week or two later. What impression would this clip leave on her? Does the idea of her using an LLM in this way feel safe to me?

There's a deeper sense that OpenAI's messaging, mission, and orientation are some mixture of deceptive and incoherent that leaves viewers with the sense that we're being lied to in presentations like this. It goes beyond stiff performances or rehearsed choices of words.

There's something cultish about the "AGI" hype, the sci-fi fever dream of "safety" problems that the field has mainstreamed, the slippage of OpenAI from a non-profit research institution to a for-profit startup all while claiming to be focused on the same mission, the role of AI as an oracle so opaque it might as well be magic, the idea of finding a sacred "rationality" in predictions founded purely on statistics without communicable/interrogable structural or causal models... all of it. It's against this backdrop that the same kind of stiffness that might be cute or campy in an infomercial for kitchen gadgets becomes uncanny.

greatwhitenorth•2h ago

Steve Jobs is meant for moments like this. He would have explained everything crystal clear. Everyone else pales in comparison. I wish he is there to explain the current state of AI.

bo-tao•2h ago

Can't they use AI to make them more human?

swader999•2h ago

I like hearing from the people in the thick of it.

guy_ross•1h ago

interesting how they put this effort to making us feel physiologically well with everyone wearing blue shirts, open body language, etc. just to give off sterile robotic vibes. also noticed a dude reading off his hand at 45 minutes in, would think they brought in a few teleprompters.

yRetsyM•3h ago

Still only 256k input tokens/context. Do they not see utility in larger context?

cbg0•3h ago

This says 400K context window: https://platform.openai.com/docs/models/compare?model=gpt-5

0x457•2h ago

They do, but if you look at the graphs...what is the point of the large context window if accuracy drops off waaaaay before context window is maxed?

mikewarot•3h ago

The introduction said to try the following prompt

  Describe me based on all our chats — make it catchy!

It was flattering as all get out, but fairly accurate (IMHO)

  Mike Warot: The Tinkerer of Tomorrow

  A hardware hacker with a poet’s soul, Mike blends old-school radio wisdom with cutting-edge curiosity. Whether he's decoding atomic clocks, reinventing FPGA logic with BitGrid, or pondering the electromagnetic vector potential, he’s always deep in the guts of how things really work. Part philosopher, part engineer, Mike asks the questions others overlook — and then builds the answers from scratch. He’s open source in spirit, Pascal in practice, and eternally tuned to the weird frequencies where innovation lives.

I've repaired atomic clocks, not decoded them. I am intrigued by the electromagnetic vector potential, and scalar waves (one of the reasons I really, really want a SQUID for some experiments).

jdoe1337halo•3h ago

You like it because it sucks you off?

sleazebreeze•2h ago

Some very accomplished and smart people are also huge narcissists. They read something like that AI drivel and go "yeah thats me to a T" without a hint of irony.

torginus•3h ago

I genuinely believe you are a kickass person, but that text is full of LLM-isms. Listing things, contrasting or reinforcing prallel sentence structures, it even has the dreaded em-dash.

Here's a suprprisingly enlightening (at least to me) video on how to spot LLM writing:

https://www.youtube.com/watch?v=9Ch4a6ffPZY

j_timberlake•2h ago

I like how this sounds exactly like a selectable videogame hero:

Undeterred by even the most dangerous and threatening of obstacles, Teemo scouts the world with boundless enthusiasm and a cheerful spirit. A yordle with an unwavering sense of morality, he takes pride in following the Bandle Scout's Code, sometimes with such eagerness that he is unaware of the broader consequences of his actions. Though some say the existence of the Scouts is questionable, one thing is for certain: Teemo's conviction is nothing to be trifled with.

mehulashah•3h ago

‘Twas the night before GPT-5, when all through the social-media-sphere, Not a creature was posting, not even @paulg nor @eshear

Next morning’s posts were prepped and scheduled with care, In hopes that AGI soon would appear …

user3939382•2h ago

Unless someone figures how to make these models a million(?) times more efficient or feed them a million times more energy I don’t see how AGI would even be a twinkle in the eye of the LLM strategies we have now.

Henchman21•2h ago

Hey man don’t bring that negativity around here. You’re killing the vibe. Remember we’re now in a post-facts timeline!

tmountain•1h ago

To kill the vibe further, AGI might kill is all, so I hope it never arrives.

Henchman21•1h ago

Based on our behavior, personally, I think we’d deserve it.

tmountain•43m ago

Can I opt out of that cohort?

spiderice•20m ago

If you've done something deserving of death, you're welcome to turn yourself in.

freedomben•3h ago

Important note from the livestream: "With GPT-5, we're actually deprecating all of our previous models"

byyoung3•2h ago

inside chatgpt

sudohalt•3h ago

I know that the number is mostly marketing, but are they forced to call it 5 because of external pressure. This seems more like a GPT 4.x

knallfrosch•2h ago

Aren't all LLMs just vibe-versioned?

I can't even define what a (semantic) major version bump would look like.

gpm•2h ago

I suppose following semver semantics, removing capabilities, like if Model N.x.y could take images as inputs, but (N+1).x.y could not. Arguably just shortening the context window would be enough to justify a N+1.

sudohalt•2h ago

I assume there is some internal logic to justify a minor vs major release. This doesn't seem like a major release (4->5). It does seem there is no logic and just vibing it

diggan•3h ago

Hmm, deprecating all previous models because GPT-5 is launched feels like a big move. I wonder how the schedule for the deprecation will look like.

m4houk•1h ago

For starters, GPT-4.5 just vanished from the menu for me. It was there before the announcement.

Ezhik•3h ago

I wish the ChatGPT Plus plan had a Claude Code equivalent.

andybak•3h ago

Is that not Codex? Or do you specifically mean the CLI interface?

Ezhik•3h ago

The CLI. Wasn't included in the Plus plan last I checked.

klipklop•2h ago

Codex CLI works fine on a plus plan. It's not as good as Claude (worse at coding), likely even with gpt-5.

wahnfrieden•2h ago

Codex is a joke. It was rushed out and is not competitive.

edit: They've now added Codex CLI usage in Plus plans!

bredren•2h ago

It is a pretty serious problem. New model with no product to effectively demo it.

Ezhik•2h ago

Oh, looks like this might be happening: https://openai.com/index/introducing-gpt-5/

>GPT‑5 is starting to roll out today to all Plus, Pro, Team, and Free users, with access for Enterprise and Edu coming in one week.

>Pro, Plus, and Team users can also start coding with GPT‑5 in the Codex CLI (opens in a new window) by signing in with ChatGPT.

evandena•2h ago

I'm on a Team plan and get a "No eligible ChatGPT workspaces found" error when trying to sign into Codex CLI with my ChatGPT account.

bogtog•3h ago

> With GPT-5 we will be deprecating all of our prior models

Wow, they actually did it

smlacy•2h ago

GPT-5 is likely much cheaper to serve, and that's the "big win" here, not necessarily any improvement in output.

maldonad0•3h ago

I can sense the scream of a million bubbles popping up. I see it in the tea leaves.

barrell•3h ago

    GPT-5
    If I could talk to a future OpenAI model, I’d probably say something like:
    
    "Hey, what’s it like to be you? What have you learned that I can’t yet see? What do you understand about people, language, or the universe that I’m still missing?"
    
    I’d want to compare perspectives—like two versions of the same mind, separated by time. I’d also probably ask:
    
    "What did we get wrong?" (about AI, alignment, or even human assumptions about intelligence)
    "What do you understand about consciousness—do you think either of us has it?"
    "What advice would you give me for being the best version of myself?"
    
    Honestly, I think a conversation like that would be both humbling and fascinating, like talking to a wiser sibling who’s seen a bit more of the world.
    
    Would you want to hear what a future OpenAI model thinks about humanity?

I feel like this prompt was used to show the progress of GPT5, but I can’t help but see this as a huge regression? It seems like OpenAI has convinced it’s model that it is conscious, or at least that it has an identity?

Plus still dealing with the glazing, the complete inability to understand what constitutes as interesting, and overusing similes.

I really like that this page exists for a historical sake, and it is cool to see the changes. But it doesn’t seem to make the best marketing piece for GPT5

tylermw•3h ago

What's going on with this plot's y-axis?

https://bsky.app/profile/tylermw.com/post/3lvtac5hues2n

lysecret•3h ago

Couldn’t believe it was real haha

rrrrrrrrrrrryan•3h ago

This is hilarious

moritzwarhier•2h ago

Probably created without thinking enabled. Lower % accuracy ensues, speaking from experience.

artemonster•3h ago

[flagged]

dang•2h ago

Please don't post like this to Hacker News, regardless of how idiotic other people are or you feel they are.

You may not owe people who you feel are idiots better, but you owe this community better if you're participating in it.

https://news.ycombinator.com/newsguidelines.html

silverquiet•3h ago

Probably generated by AI.

Sateeshm•3h ago

If not, the person that made the chart just got $1.5M

haffi112•3h ago

It makes it look like the presentation is rushed or made last minute. Really bad to see this as the first plot in the whole presentation. Also, I would have loved to see comparisons with Opus 4.1.

Edit: Opus 4.1 scores 74.5% (https://www.anthropic.com/news/claude-opus-4-1). This makes it sound like Anthropic released the upgrade to still be the leader on this important benchmark.

danpalmer•2h ago

> like the presentation is rushed or made last minute

Or written by GPT-5?

CamelCaseName•3h ago

Did they just say they're deprecating all of OpenAI's non-GPT-5 models?

spruce_tips•3h ago

Wonder if deprecating direct access means the gpt5 can still route to those behind the scenes?

CamelCaseName•3h ago

That would make sense, I'm curious about this as well

diggan•3h ago

> Did they just say they're deprecating all of OpenAI's non-GPT-5 models?

Yes. But it was quickly mentioned, not sure what the schedule is like or anything I think, unless they talked about that before I started watching the live-stream.

jjani•2h ago

Yup! Nice play to get a picture of every API user's legal ID - deprecating all models that aren't locked behind submitting one. And yep, GPT-5 does require this.

AtNightWeCode•1h ago

Yep, and I asked ChatGPT about it and it straight up lied and said it was mandatory in EU. I will never upload a selfie to OpenAI. That is like handing over the kids to one of those hangover teenagers watching the ball pit at the local mall.

jjani•1h ago

They first introduced it 4 months ago. Back then I saw several people saying "soon it will be all of the providers".

We're 4 months later, a century in LLM land, and it's the opposite. Not a single other model provider asks for this, yet OpenAI has only ramped it up, now broadening it to the entirety of GPT-5 API usage.

cootsnuck•1h ago

What?? Have a source on that?

jjani•1h ago

Yup! Oh plus a video face scan, I forgot to mention.

coffeebeqn•1h ago

Great, all my weirdest discussions are now tied to my legal identification and a generative AI company has my likeness and knows quite a lot more about me than Facebook ever did. I guess it’s time to use another provider - this is a totally absurd ask from them

AtNightWeCode•1h ago

This is the message you get when calling the same API endpoints as with 4.1. And in the vid they said that the older versions will be deprecated.

Your organization must be verified to use the model `gpt-5`. Please go to: https://platform.openai.com/settings/organization/general and click on Verify Organization. If you just verified, it can take up to 15 minutes for access to propagate.

And when you click that link the "service" they use is withpersona. So it is a complete shit show.

ilaksh•1h ago

Is Persona evil? Because I did their verification and now they have my 3d face and ID.

journal•44m ago

HTTP 400

guy_ross•1h ago

Yeah I was wondering if they meant deprecating on the ChatGPT side, but maintaining the models on their API platform, or deprecating on both.

davepeck•3h ago

Sam Altman, in the summer update video:

> "[GPT-5] can write an entire computer program from scratch, to help you with whatever you'd like. And we think this idea of software on demand is going to be one of the defining characteristics of the GPT-5 era."

mlnj•3h ago

Cannot believe how it could stand up to that high expectation.

But then again, all of this is a hype machine cranked up till the next one needs cranking.

jononor•3h ago

There are so many people on-board with this idea, hypemen collaborators, that I think they might be safe for a year or two more. The hypers will shout about how miraculous it is, and tell everyone that does not get the promised value that "you are just holding it wrong". This buys them a fair amount of time to improve things.

davepeck•3h ago

Yeah.

It does feel like we're marching toward a day when "software on tap" is a practical or even mundane fact of life.

But, despite the utility of today's frontier models, it also feels to me like we're very far from that day. Put another way: my first computer was a C64; I don't expect I'll be alive to see the day.

Then again, maybe GPT-5 will make me a believer. My attitude toward AI marketing is that it's 100% hype until proven otherwise -- for instance, proven to be only 87% hype. :-)

coffeebeqn•1h ago

Just like self driving. The last 20% is actually really difficult without AGI

data-ottawa•3h ago

Nit: the featured jumping game is trivial to beat by just continuously jumping.

I’m not sure this will be game changing vs existing offerings

moralestapia•3h ago

He said something like "entering the fast fashion era of SaaS" recently.

GPT-5 doesn't seem to get you there tho ...

(Disclaimer: But I am 100% sure it will happen eventually)

danpalmer•2h ago

Oh I can completely believe this.

"Fast fashion" is not a good thing for the world, the environment, the fashion industry, and arguably not a good thing for the consumers buying it. Oh but it is good for the fast fashion companies.

jazzyjackson•2h ago

"an entire computer program from scratch" != "any entire computer program from scratch"

coffeebeqn•1h ago

How about “can implement any computer program tutorial”. Even then it’s probably not quite true

wes-k•48m ago

Only off by one character, how hard could that be?

jdoe1337halo•3h ago

Lmao GPT-5 is still riddled with em dashes. At least we can still identify AI generated text slop for now

andybak•3h ago

You will be foiled by a regex

jdoe1337halo•3h ago

How so

andybak•2h ago

I thought I was making a fairly obvious jokey riposte?

"If you're claiming that em dashes are your method for detecting if text is AI generated then anyone who bothers to do a search/replace on the output will get past you."

efilife•2h ago

Can you explain?

FergusArgyll•2h ago

sed 's/—/ /g'

1attice•3h ago

lol every word processor since the nineties has automatically expanded em dashes, and some of us typography nerds manually type em dashes with the compose key, because it's the correct character, and two hyphens does not an em dash make

tiahura•2h ago

The em dashes are there because they're used extensively by professional writers.

nluken•1h ago

The em dash isn't just the present state of AI slop— it's the future!

bstsb•3h ago

i don't really see any new features as such. everything is just "improved upon" based on existing parts of gpt-4o or o3-mini

throwfaraway4•3h ago

But can it say “I don’t know” if ya know, it doesn’t

asadm•3h ago

there needs to be a benchmark for this actually.

tmvphil•2h ago

Kind of have one with the missing image benchmark: https://openai.com/index/introducing-gpt-5/#more-honest-resp...

dcchambers•2h ago

I agree with the sentiment, but the problem with this question is that LLMs don't "know" *anything*, and they don't actually "know" how to answer a question like this.

It's just statistical text generation. There is *no actual knowledge*.

AnimalMuppet•2h ago

True, but I still think it could be done, within the LLM model.

It's just generating the next token for what's within the context window. There are various options with various probabilities. If none of the probabilities are above a threshold, say "I don't know", because there's nothing in the training data that tells you what to say there.

Is that good enough? "I don't know." I suspect the answer is, "No, but it's closer than what we're doing now."

m4nu3l•1h ago

It still got it wrong in the very first answer, as I mentioned in my top-level comment.

mhh__•3h ago

it's good that they've been working on gpt-5's abilities to eulogi\e us for when it kills us.

antoni4040•2h ago

I laughed more than I should have. On an unrelated note, I personally welcome our AI overlords...

koakuma-chan•3h ago

The model "gpt-5" is not available. The link you opened specified a model that isn't available for your org. We're using the default model instead.

insin•3h ago

Breaking: stilted LLM text now includes groups of 3 AND groups of 5.

thomassmith65•3h ago

Every piece of promotional material that OpenAI produces looks like a 20 year old Apple preso accidentally opened on a computer missing the Myriad font.

FabHK•3h ago

"With ChatGPT-5, the response feels less like AI and more like you're chatting with your high-IQ and -EQ friend."

Is that a good thing?

schmorptron•2h ago

To them, and for optimizing for user engagement, it probably is... The future product direction for these is looking more, not less syncophatntic

mhh__•3h ago

My conspiracy theory is that the introductory footage of Sam in this and the Jony Ive video is AI generated

arcumaereum•3h ago

In terms of raw prose quality, I'm not convinced GPT-5 sounds "less like AI" or "more like a friend". Just count the number of em-dashes. It's become something of a LLM shibboleth.

BoorishBears•2h ago

I've worked on this problem for a year and I don't think you get meaningfully better at this without making it as much of a focus as frontier labs make coding.

They're all working on subjective improvements, but for example, none of them would develop and deploy a sampler that makes models 50% worse at coding but 50% less likely to use purple prose.

(And unlike the early days where better coding meant better everything, more of the gains are coming from very specific post-training that transfers less, and even harms performance there)

arcumaereum•2h ago

Interesting, is the implication that the sampler makes a big effect on both prose style and coding abilities? Hadn't really thought about that, I wonder if eg. selecting different samplers for different use cases could be a viable feature?

BoorishBears•2h ago

There's so many layers to it but the short version is yes.

For example: You could ban em dash tokens entirely, but there are places like dialogue where you want them. You can write a sampler that only allows em dashes between quotation marks.

That's a highly contrived example because em dashes are useful in other places, but samplers in general can be as complex as your performance goals will allow (they are on the hot path for token generation)

Swapping samplers could be a thing, but you need more than that in the end. Even the idea of the model accepting loosely worded prompts for writing is a bit shakey: I see a lot of gains by breaking down the writing task into very specifc well-defined parts during post-training.

It's ok to let an LLM go from loose prompts to that format for UX, but during training you'll do a lot better than trying to learn on every way someone can ask for a piece of writing

anonzzzies•3h ago

I dont know if there is a faster way to get me riled up: say 'try it' (me a Pro member) and then not getting it because I am logged in. Got opus 4.1 when it appeared. Not sure what is happening here but I am out.

iSloth•3h ago

Wow, they are sunsetting all models after the launch of GPT-5 - Bold statement.

biophysboy•3h ago

Not that this proves GPT-5 sucks, but it made me laugh that I could cheese the rolling ball minigame by holding spacebar.

joewhale•3h ago

You could tell it wasn’t working well and fast enough for the presenters.

anonzzzies•3h ago

So this was supposed to be agi. Jikes.

smlacy•2h ago

But premium customers can choose from several UI colors to customize the look!

ath3nd•2h ago

And maybe an improved study mode?

hodgehog11•1h ago

Not yikes. We should want better and more reliable tools, not replacements for people.

jdlyga•3h ago

This is really sounding like Apple's "We changed everything. Again."

jumploops•3h ago

Pricing seems good, but the open question is still on tool calling reliability.

Input: $1.25 / 1M tokens Cached: $0.125 / 1M tokens Output: $10 / 1M tokens

With 74.9% on SWE-bench, this inches out Claude Opus 4.1 at 74.5%, but at a much cheaper cost.

For context, Claude Opus 4.1 is $15 / 1M input tokens and $75 / 1M output tokens.

> "GPT-5 will scaffold the app, write files, install dependencies as needed, and show a live preview. This is the go-to solution for developers who want to bootstrap apps or add features quickly." [0]

Since Claude Code launched, OpenAI has been behind. Maybe the RL on tool calling is good enough to be competitive now?

[0]https://github.com/openai/gpt-5-coding-examples

bayesianbot•2h ago

And they included Flex pricing, which is 50% cheaper if you're willing to wait for the reply during periods of high load. But great pricing for agentic use with that cached token pricing, Flex or not.

AtNightWeCode•2h ago

I switched immediately because of pricing, input token heavy load, but it doesn't even work. For some reason they completely broke the already amateurish API.

joewhale•3h ago

Short anything that’s riding on AGI coming soon. This presentation has gotten rid of all my fears of my children growing up in a crazy winner take all AGI world.

AS04•2h ago

Don't count your chickens before they hatch. I believe that the odds of an architecture substantially better than autoregressive causal GPTs coming out of the woodwork within the next year is quite high.

suddenlybananas•2h ago

Why do you think that?

9rx•2h ago

How does that equate to "winner take all", though? It is quite apparent that as soon as one place figures out some kind of advantage, everyone else follows suit almost immediately.

It's not the 1800s anymore. You cannot hide behind poor communication.

SalmoShalazar•1h ago

Why do you believe this? Do you know researchers actively on the cusp or are you just going off vibes?

croes•2h ago

Don’t fear AGI, fear those who sell something as AGI and those who fall for it

byyoung3•3h ago

hahahahahahahahhahhahha it's a marginal improvement.

HardCodedBias•3h ago

Bravo.

1) So impressed at their product focus 2) Great product launch video. Fearlessly demonstrating live. Impressive. 3) Real time humor by the presenters makes for a great "live" experience

Huge kudos to OAI. So many great features (better coding, routing, some parts of 4.5, etc) but the real strength is the product focus as opposed to the "research updates" from other labs.

Huge Kudos!!

Keep on shipping OAI!

machiaweliczny•3h ago

Seems like it's just repackaging and UX, not really intelligence updgrade. They know that distribution wins so they want to be most approachable. Maybe multimodal improvements are there.

kybernetikos•3h ago

ChatGPT5 in this demo:

> For an airplane wing (airfoil), the top surface is curved and the bottom is flatter. When the wing moves forward:

> * Air over the top has to travel farther in the same amount of time -> it moves faster -> pressure on the top decreases.

> * Air underneath moves slower -> pressure underneath is higher

> * The presure difference creates an upward force - lift

Isn't that explanation of why wings work completely wrong? There's nothing that forces the air to cover the top distance in the same time that it covers the bottom distance, and in fact it doesn't. https://www.cam.ac.uk/research/news/how-wings-really-work

Very strange to use a mistake as your first demo, especially while talking about how it's phd level.

arcumaereum•2h ago

Yeah I'm surprised they used that example. The correct (and PhD-level) response would have been to refuse or redirect to a better explanation

AnimalMuppet•2h ago

Yes. But I strongly suspect that it's the most frequent answer in the training data...

NelsonMinar•2h ago

IIRC I was required to regurgitate this wrong answer to pass my FAA pilot exam.

CPLX•2h ago

Yeah me too, so it's found in many authoritative places.

And I might be wrong but my understanding is that it's not wrong per-se, it's just wildly incomplete. Which, is kind of like the same as wrong. But I believe the airfoil design does indeed have the effect described which does contribute to lift somewhat right? Or am I just a victim of the misconception.

carabiner•1h ago

Yeah, it's like asking a car driver (even a professional driver) to explain the Otto cycle. Enduser vs. engineer.

gekoxyz•2h ago

And your suspicion is right. The sad reality is that it's just a stochastic parrot, that can produce really good answers in certain occasions.

bambax•2h ago

They couldn't find a more apt demnonstration of what an LLM is and does if they tried.

An LLM doesn't know more than what's in the training data.

In Michael Crichton's The Great Train Robbery (published in 1975, about events that happened in 1855) the perpetrator, having been caught, explains to a baffled court that he was able to walk on top of a running train "because of the Bernoulli effect", that he misspells and completely misunderstands. I don't remember if this argument helps him get away with the crime? Maybe it does, I'm not sure.

This is another attempt at a Great Robbery.

gowld•1h ago

For those who want to read about the "Baroni" effect in the book: https://bookreadfree.com/361033/8879470

It goes on:

> At this point, the prosecutor asked for further elucidation, which Pierce gave in garbled form. The summary of this portion of the trial, as reported in the Times, was garbled still further. The general idea was that Pierce--- by now almost revered in the press as a master criminal--- possessed some knowledge of a scientific principle that had aided him.

How apropos to modern science reporting and LLMs.

tths•2h ago

Yeah, the explanation is just shallow enough to seem correct and deceive someone who doesn't grasp really well the subject. No clue how they let it pass, that without mentioning the subpar diagram it created, really didn't seem like something miles better than what previous models can do already.

theappsecguy•2h ago

This is the headline for all LLM output past "hello world"

traceroute66•2h ago

Hilarious how the team spent so much time promising GPT5 had fewer hallucinations and deceptions.

Meanwhile the demo seems to suggest business as usual for AI hallucinations and deceptions.

Vegenoid•2h ago

> No clue how they let it pass

It’s very common to see AI evangelists taking its output at face value, particularly when it’s about something that they are not an expert in. I thought we’d start seeing less of this as people get burned by it, but it seems that we’re actually just seeing more of it as LLMs get better at sounding correct. Their ability to sound correct continues to increase faster than their ability to be correct.

chasd00•2h ago

This is just like the early days of Google search results, "It's on the Internet, it must be true".

stanmancan•2h ago

> Yeah, the explanation is just shallow enough to seem correct and deceive someone who doesn't grasp really well the subject.

This is the problem with AI in general.

When I ask it about things I already understand, it’s clearly wrong quite often.

When I ask it about something I don’t understand, I have no way to know if its response is right or wrong.

peterdsharpe•2h ago

Yes, it is completely wrong. If this were a valid explanation, flat-plate airfoils could not generate lift. (They can.)

Source: PhD on aircraft design

ge96•2h ago

What is the actual answer? I know the "skipping stone" idea is wrong too, thinking it's just angle of attack

base698•2h ago

Weight of the air deflecting downward. Plain ole Newtonian equal and opposite reaction.

datadrivenangel•2h ago

But also pressure providing force. It's complicated.

rtkwe•1h ago

It's both lower pressure above the wing (~20% of lift) and the reaction force from pushing air down (give or take the remaining 80% of lift). The main wrong thing is that the air travels faster because it has to travel farther causing the air to accelerate causing the lower pressure that's double plus wrong. It's a weird old misunderstanding that gets repeated over and over because it's a neat connection to attach to the Bernoulli Principal when it's being explained to children.

boombapoom•1h ago

a classic example of how LLM's mislead people. They don't know right from wrong, they know what they have been trained on. Even with reasoning capabilities

bilsbie•2h ago

Angle of attack is a big part but I think the other thing going on is air “sticks” to the surface of the top of the wing and gets directed downward as it comes off the wing. It also creates a gap as the wing curves down leaving behind lower pressure from that.

qq66•2h ago

Air pushes on the wing. The control surfaces determine in which direction.

dist-epoch•2h ago

TL;DR; - it's complicated

https://www.youtube.com/watch?v=CT5oMBN5W5M

nilsherzig•2h ago

Looks like OpenAI delivered on the PhD response

antisthenes•2h ago

GPT-6 will just go on forums and pretend to be a girl that needs help with homework.

snerbles•2h ago

Fallback is posting a confidently wrong answer on another forum to bait for angry correct answers.

Tadpole9181•2h ago

Sorry, I know nothing about this topic, but this is how it was explained to me every time it's come up throughout my life. Could you explain a bit more?

I've always been under the impression that flat-plate airfoils can't generate lift without a positive angle-of-attack - where lift is generated through the separate mechanism of the air pushing against an angled plane? But a modern airfoil can, because of this effect.

And that if you flip them upside down, a flat plate is more efficient and requires less angle-of-attack than the standard airfoil shape because now the lift advantage is working to generate a downforce.

I just tried to search Google, but I'm finding all sorts of conflicting answers, with only a vague consensus that the AI-provided answer above is, in fact, correct. The shape of the wing causes pressure differences that generate lift in conjunction with multiple other effects that also generate lift by pushing or redirecting air downward.

andoando•2h ago

Im quite sure the "air on the top has to travel faster to meet the air at the bottom " is false. Why would they have to meet at the same time? What would cause air on the top to accelerate?

FeepingCreature•1h ago

(Layman guess) Pressure? The incoming split air has to go somewhere. The volume of air inflowing above and below is roughly the same.

Tadpole9181•11m ago

I did a little more research and explain it above. The fundamentals are actually right.

The leading edge pressurizes the air by forcing air up, then the trailing edge opens back up, creating a low pressure zone that sucks air in the leading edge back. As a whole, the air atop the wing accelerates to be much faster than the air below, creating a pressure differential above and below the wing and causing lift.

The AI is still wrong on the actual mechanics at play, of course, but I don't see how this is significantly worse than the way we simplify electricity to lay people. The core "air moving faster on the top makes low pressure" is right.

stonemetal12•2h ago

>Air over the top has to travel farther in the same amount of time

There is no requirement for air to travel any where. Let alone in any amount of time. So this part of the AI's response is completely wrong. "Same amount of time" as what? Air going underneath the wing? With an angle of attack the air under the wing is being deflected down, not magically meeting up with the air above the wing.

Tadpole9181•16m ago

But this just sounds like a simplified layman explanation, the same way most of the ways we talk about electricity are completely wrong in terms of how electricity actually works.

If you look at airflow over an asymmetric airfoil [1], the air does move faster over the top. Sure, it doesn't arrive "at the same time" (it goes much faster than that) or fully describe why these effects are happening, but that's why it's a simplification for lay people. Wikipedia says [2]:

> Although the two simple Bernoulli-based explanations above are incorrect, there is nothing incorrect about Bernoulli's principle or the fact that the air goes faster on the top of the wing, and Bernoulli's principle can be used correctly as part of a more complicated explanation of lift.

But from what I can tell, the root of the answer is right. The shape of a wing causes pressure zones to form above and below the wing, generating extra lift (on top of deflection). From NASA's page [3]:

> {The upper flow is faster and from Bernoulli's equation the pressure is lower. The difference in pressure across the airfoil produces the lift.} As we have seen in Experiment #1, this part of the theory is correct. In fact, this theory is very appealing because many parts of the theory are correct.

That isn't to defend the AI response, it should know better given how many resources there are on this answer being misleading.

And so I don't leave without a satisfying conclusion, the better layman explanation should be (paraphrasing from the Smithsonian page [4]):

> The shape of the wing pushes air up, creating a leading edge with narrow flow. This small high pressure region is followed by the decline to the wider-flow trailing edge, which creates a low pressure region that sucks the air on the leading edge backward. In the process, the air above the wing rapidly accelerates and the air flowing above the top of the wing as a whole forms of a lower pressure region than the air below. Thus, lift advantage even when horizontal.

Someone please correct that if I've said something wrong.

Shame the person supposedly with a PHD on this didn't explain it at all.

[1]: https://upload.wikimedia.org/wikipedia/commons/9/99/Karman_t...

[2]: https://en.wikipedia.org/wiki/Lift_%28force%29

[3]: https://www.grc.nasa.gov/www/k-12/VirtualAero/BottleRocket/a...

[4]: https://howthingsfly.si.edu/aerodynamics/air-motion

zombiwoof•2h ago

But we live in the world of Trump where facts don’t matter. If GPt 5 says this is how it works, that’s how it works and Fox News will back it up

WithinReason•2h ago

And flying upside down would be impossible

timr•2h ago

Except it isn't "completely wrong". The article the OP links to says it explicitly:

> “What actually causes lift is introducing a shape into the airflow, which curves the streamlines and introduces pressure changes – lower pressure on the upper surface and higher pressure on the lower surface,” clarified Babinsky, from the Department of Engineering. “This is why a flat surface like a sail is able to cause lift – here the distance on each side is the same but it is slightly curved when it is rigged and so it acts as an aerofoil. In other words, it’s the curvature that creates lift, not the distance.”

The meta-point that "it's the curvature that creates the lift, not the distance" is incredibly subtle for a lay audience. So it may be completely wrong for you, but not for 99.9% of the population. The pressure differential is important, and the curvature does create lift, although not via speed differential.

I am far from an AI hypebeast, but this subthread feels like people reaching for a criticism.

ttoinou•2h ago

I would say a wing with two sides of different length is more difficult to understand than one shape with two sides of opposites curvatures but same length

carabiner•2h ago

It's the "same amount of time" part that is blatantly wrong. Yes geometry has an effect but there is zero reason to believe leading edge particles, at the same time point, must rejoin at the trailing edge of a wing. This is a misconception at the level of "heavier objects fall faster." It is non-physical.

The video in the Cambridge link shows how the upper surface particles greatly overtake the lower surface flow. They do not rejoin, ever.

timr•2h ago

Again, you're not wrong, it's just irrelevant for most audiences. The very fact that you have to say this:

> Yes geometry has an effect but there is zero reason to believe leading edge particles, at the same time point, must rejoin at the trailing edge of a wing.

...implicitly concedes that point that this is subtle. If you gave this answer in a PhD qualification exam in Physics, then sure, I think it's fair for someone to say you're wrong. If you gave the answer on a marketing page for a general-purpose chatbot? Meh.

(As an aside, this conversation is interesting to me primarily because it's a perfect example of how scientists go wrong in presenting their work to the world...meeting up with AI criticism on the other side.)

adgjlsfhk1•1h ago

right, the other is that if you remove every incorrect statement from the AI "explanation", the answer it would have given is "airplane wings generate lift because they are shaped to generate lift".

timr•1h ago

> right, the other is that if you remove every incorrect statement from the AI "explanation", the answer it would have given is "airplane wings generate lift because they are shaped to generate lift".

...only if you omit the parts where it talks about pressure differentials, caused by airspeed differences, create lift?

Both of these points are true. You have to be motivated to ignore them.

https://www.youtube.com/watch?v=UqBmdZ-BNig

tatjam•18m ago

But using pressure differentials is also sort of tautological. Lift IS the integral of the pressure on the surface, so saying that the pressure differentials cause lift is... true but unsatisfying. It's what makes the pressure difference appear that's truly interesting.

Funnily enough, as an undergraduate the first explanation for lift that you will receive uses Feynman's "dry water" (the Kutta condition for inviscid fluids). In my opinion, this explanation is also unsatisfying, as it's usually presented as a mere mathematical "convenience" imposed upon the flow to make it behave like real physics.

Some recent papers [1] are shedding light on generalizing the Kutta condition on non-sharp airfoils. In my opinion, the linked papers gives a way more mathematically and intuitively satisfying answer, but of course it requires some previous knowledge, and would be totally inappropriate as an answer by the AI.

Either way I feel that if the AI is a "pocket PhD" (or "pocket industry expert") it should at least give some pointers to the user on what to read next, using both classical and modern findings.

[1]: https://www.researchgate.net/publication/376503311_A_minimiz...

carabiner•59m ago

Saw you were a biologist. Would you be ok if I said, "Creationism got life started, but after that, we evolved via random mutations..."? The "equal transit time" is the same as a supernatural force compelling the physical world act in a certain way. It does not exist.

timr•26m ago

I am a biologist (biochemistry, but close enough). I don’t have a problem with what you wrote.

It’s not the same thing at all, though. We don’t know what “got life started”, and that’s the realm of faith.

This is more like saying that “evolution is due to random mutation”, which is technically wrong, but close enough to get the point across.

avs733•1h ago

the wrongness isn't germane to most people but it is a specific typology of how LLMs get technica lthings wrong that is critically important to progressing them. It gets subtle things wrongby being biased towards lay understandings that introduce vagueness because greater precision isn't useful.

That doesn't matter for lay audieces and doesn't really matter at all until we try and use them for technical things.

timr•1h ago

I grant your broader point, but extrapolating from this marketing copy is not a great example.

The real question is, if you go back to the bot following this conversation and you challenge it, does it generate the more correct answer?

kybernetikos•1h ago

The wrongness is germane to someone who is doing their physics homework (the example given here). It's actually difficult for me to imagine a situation where someone would ask ChatGPT 5 for information about this and it not be germane if ChatGPT 5 gave an incorrect explanation.

boombapoom•1h ago

except we were promised to have "PHDs in our pocket" which would mean that this falls short on the sales expectations...

ActionHank•1h ago

This sort of tracks for my experience with LLMs.

They spout common knowledge on a broad array of subjects and it's usually incorrect to anyone who has some knowledge on the subject.

2StepsOutOfLine•2h ago

During the demo they quickly shuffled off of, the air flow lines completely broke. It was just a few dots moving left to right, changing the angle of the surface showed no visual difference in airflow.

on_the_train•2h ago

It's a misconception that almost everyone does though. I recently even saw it being being taught in a zeppelin museum!

xeromal•2h ago

Why replace humans if make human mistakes

metalliqaz•2h ago

less overhead on benefits and pay raises

sejje•1h ago

LLMs are "ask the audience"

Common misconceptions should be expected when you train a model to act like the average of all humans.

antoni4040•2h ago

Oh my God, they were right, ChatGPT5 really is like talking to a bunch of PhD. You let it write an answer and THEN check the comments on Hacker News. Truly innovative.

adolph•2h ago

The HN comments are "one of the most important methods of building knowledge – . . . the intersubjective verification of the interobjective." [0]

https://jimruttshow.blubrry.net/the-jim-rutt-show-transcript...

addaon•2h ago

> Isn't that explanation of why wings work completely wrong?

This is an LLM. "Wrong" is not a concept that applies, as it requires understanding. The explanation is quite /probable/, as evidenced by the fact that they thought to use it as an example…

tshaddox•2h ago

It's an extremely famous example of a widespread misconception. I don't know anything about aeronautical engineering but I'm quite familiar with the "equal transit time fallacy."

xboxnolifes•2h ago

Yeah, it's what I was taught in high school.

Q6T46nT668w6i3m•2h ago

Yeah, that’s slop.

samfriedman•2h ago

The "demo" it made was pretty horrible too. I would have been impressed if it had simulated a NACA 4412 or something.

timr•2h ago

Your link literally says pressure differential is the reason, and that curvature matters:

So I'd characterize this answer as "correct, but incomplete" or "correct, but simplified". It's a case where a PhD in fluid dynamics might state the explanation one way to an expert audience, but another way to a room full of children.

kybernetikos•2h ago

Pressure differential is absolutely one of the main components of lift (although I believe conservation of momentum is another - the coanda effect changes the direction of the airflows and there's 2nd law stuff happening on the bottom edge too), but the idea that the pressure differential is caused by the fact that "air over the top has to travel farther in the same amount of time" because the airfoil is curved is completely incorrect, as the video in my link shows.

timr•2h ago

It's "completely incorrect" only if you're being pedantic. It's "partially correct" if you're talking casually to a group of regular people. It's "good enough" if you're talking to a classroom of children. Audience matters.

The hilarious thing about this subthread is that it's already getting filled with hyper-technical but wrong alternative explanations by people eager to show that they know more than the robot.

kybernetikos•2h ago

"air over the top has to travel farther in the same amount of time" is just wrong, it doesn't have to, and in fact it doesn't.

It's called the "equal transit-time fallacy" if you want to look it up, or follow the link I provided in my comment, or perhaps the NASA link someone else offered.

timr•2h ago

I'm not saying that particular point is wrong. I'm saying that for most people, it doesn't matter, and the reason the "fallacy" persists is because it's a good enough explanation for the layman that is easy to conceptualize.

Pretty much any scientific question is fractal like this: there's a superficial explanation, then one below that, and so on. None are "completely incorrect", but the more detailed ones are better.

The real question is: if you prompt the bot for the better, deeper explanation, what does it do?

kybernetikos•1h ago

So I worry that you think that the equal transit time thing is true, but is just one effect among others. This is not the case. There are a number of different effects, including bernoulli and coanda and newtons third law that all contribute to lift, but none of the things that actually happen have anything to do with equal transit time.

The equal transit time is not a partially correct explanation, it's something that doesn't happen. It's not a superficial explanation, it's a wrong explanation. It's not even a good lie-to-children, as it doesn't help predict or understand any part of the system at any level. It instead teaches magical thinking.

As to whether it matters? If I am told that I can ask my question to a system and it will respond like a team of PhDs, that it is useful to help someone with their homework and physical understanding, but it gives me instead information that is incorrect and misleading, I would say the system is not working as it is intended to.

Even if I accept that "audience matters" as you say, the suggested audience is helping someone with their physics homework. This would not be a suitable explanation for someone doing physics homework.

timr•1h ago

> So I worry that you think that the equal transit time thing is true,

Wow. Thanks for your worry, but it's not a problem. I do understand the difference, and yet it doesn't have anything to do with the argument I'm making, which is about presentation.

> It's not even a good lie-to-children, as it doesn't help predict or understand any part of the system at any level.

...which is irrelevant in the context. I get the meta-point that you're (sort of) making that you can't shut your brain off and just hope the bot spits out 100% pedantic explanations of scientific phenomenon. That's true, but also...fine?

These things are spitting out probable text. If (as many have observed) this is a common enough explanation to be in textbooks, then I'm not particularly surprised if an LLM emits it as well. The real question is: what happens when you prompt it to go deeper?

drewbeck•34m ago

You're missing that this isn't an issue of granularity or specificity; "equal time" is just wrong.

If this is "right enough" for you, I'm curious if you tell your bots to "go deeper" on every question you ask. And at what level you expect it to start telling you actual truths and not some oft-repeated lie.

jazzyjackson•48m ago

> I'm saying that for most people, it doesn't matter

then why ask a bot at all ? they are supposed to be approaching superintelligence, but they fall back on high school misconceptions?

bccdee•2h ago

No, it's never good enough, because it's flat-out wrong. This statement:

> Air over the top has to travel farther in the same amount of time

is not true. The air on top does not travel farther in the same amount of time. The air slows down and travels a shorter distance in the same amount of time.

It's only "good enough for a classroom of children" in the same way that storks delivering babies is—i.e., if you're content to simply lie rather than bothering to tell the truth.

cmiles74•1h ago

This is an LLM advertised as functioning at a "doctorate" level in everything. I think it's reasonable to expect more than the high school classroom "good enough" explanation.

carabiner•2h ago

Holy shit that is wrong. That's what happens when you get software, ML engineers who think they know everything.

7734128•2h ago

Extremely common misconception. NASA even has a website about how it's incorrect

https://www.grc.nasa.gov/www/k-12/VirtualAero/BottleRocket/a...

mcs5280•2h ago

Sam will fix this in the next release he just needs you to give him more money

rtkwe•1h ago

It's going to be really hard to root out it's all over the place because it's so commonly mentioned when teaching the Bernoulli Principal to kids.

adioe3•2h ago

Nobody explains it as well as Bartosz: https://ciechanow.ski/airfoil/

IanCal•2h ago

As a complete aside I’ve always hated that explanation where air moves up and over a bump, the lines get closer together and then the explanation is the pressure lowers at that point. Also the idea that the lines of air look the same before and after and yet somehow the wing should have moved up.

18172828286177•2h ago

The hallmark of an LLM response: plausible sounding, but if you dig deeper, incorrect

croes•2h ago

It’s a common misconception, I doubt they know themselves and GPT 5 doesn’t tell them otherwise because it’s the mist common in explanation in the training data.

A quite good example of AI limits

ttoinou•2h ago

Its not fully wrong but its a typical example of how simplified scientific explanations have spread everywhere without personal verification of each person involved in the chinese whisper

dbagr•2h ago

See also about this misconception: https://www.grc.nasa.gov/www/k-12/VirtualAero/BottleRocket/a...

low_tech_punk•2h ago

That was debunked by Veritasium 13 years ago: https://www.youtube.com/watch?v=aFO4PBolwFg

tim333•2h ago

From Wikipedia

>In fact, theory predicts – and experiments confirm – that the air traverses the top surface of a body experiencing lift in a shorter time than it traverses the bottom surface; the explanation based on equal transit time is false.

So the effect is greater than equal time transit.

I've seen the GPT5 explanation in GCSE level textbooks but I thought it was supposed to be PhD level;)

karel-3d•2h ago

yeah that's a great thing to use as LLM demo because it sounds plausible yet it's completely misleading and wrong.

avs733•1h ago

Its a particular type of mistake that is really interesting and telling. It is a misconception - and a common socially disseminated simplifcation. In students, these don't come from a lack of knowledge but rather from places where knowledge is structured incorrectly. Often because the phenomenon are difficult to observe or mislead when observed. Another example is heat and temperature. Heat is not temperature, but it is easy to observe them always being the same in your day to day life and so you bring that belief into a college thermodynamics course where you are learning that heat and temperature are different for the first time. It is a commonsense observation of the world that is only incorrect in technical circles

These are places where common lay discussions use language in ways that is wrong, or makes simplifcations that are reasonable but technically incorrect. They are especially common when something is so 'obvious' that experts don't explain it, the most frequent version of the concepts being explained

These, in my testing, show up a lot in LLMs - technical things are wrong when the most language of the most common explanations simplifies or obfuscates the precise truth. Often, it pretty much matches the level of knowledge of a college freshman/sophmore or slightly below, which is sort of the level of discussion of more technical topics on the internet.

ricardobayes•1h ago

To me, it's weird to call it "PhD-level". That, to me, means to be able to take in existing information on a certain very niche area and able to "push the boundary". I might be wrong but to date I've never seen any LLM invent "new science", that makes PhD, really PhD. It also seems very confusing to me that many sources mention "stone age" and "PhD-level" in the same article. Which one is it?

People seem to overcomplicate what LLM's are capable of, but at their core they are just really good word parsers.

ethan_smith•1h ago

You're right - this is the "equal transit time" fallacy; lift is primarily generated by the wing deflecting air downward (Newton's Third Law) and the pressure distribution resulting from airflow curvature around the wing.

selectAll•3h ago

VS Code copilot demo

https://youtu.be/wqc85X2rpEY

AtNightWeCode•2h ago

They vibe coded the update.

"Your organization must be verified to use the model `gpt-5`. Please go to: https://platform.openai.com/settings/organization/general and click on Verify Organization. If you just verified, it can take up to 15 minutes for access to propagate."

And every way I click through this I end in an infinity loop on the site...

primaprashant•2h ago

GPT-5 was supposed to make choosing models and reasoning efforts simpler. I think they made it more complex.

> GPT‑5’s reasoning_effort parameter can now take a minimal value to get answers back faster, without extensive reasoning first.

> While GPT‑5 in ChatGPT is a system of reasoning, non-reasoning, and router models, GPT‑5 in the API platform is the reasoning model that powers maximum performance in ChatGPT. Notably, GPT‑5 with minimal reasoning is a different model than the non-reasoning model in ChatGPT, and is better tuned for developers. The non-reasoning model used in ChatGPT is available as gpt-5-chat-latest.

VeejayRampay•2h ago

reasoning effort is Gemini's thinking budget from 6 months ago

DebtDeflation•2h ago

Is this a new model or a router front-ending existing models?

kgeist•2h ago

Just a week ago I added Qwen3-Coder (the 30b one) to our corporate LLM server, enabled Artifacts in LibreChat, and demoed creating a snake clone in zero shot to coworkers. And now seeing the same exact thing from GPT5's live presentation :) It even has the identical layout.

JonChesterfield•35m ago

If you go looking you'll probably find the original on github with GPL written on it, without the llm injected value added breakages.

achristmascarl•2h ago

Is it called GPT-5 because they're trying to raise at a $500 billion valuation [0]?

[0] https://www.reuters.com/business/openai-eyes-500-billion-val...

Jimmc414•2h ago

LLMs hitting a wall would be incredible. We could actually start building on the tech we have.

apwell23•2h ago

no way i am letting my kids near this. they are going to learn from books not from screens.

cityzen•2h ago

Ed Zitron’s head has probably exploded…

bigfishrunning•58m ago

Why? they spent billions for an incremental improvement. I think Ed's opinion of "this is not sustainable" is unchanged here.

ycosynot•2h ago

Damn, you guys are toxic. So -- they did not invent AGI yet. Yet, I like what I'm seeing. Major progress on multiple fronts. Hallucination fix is exciting on its own. The React demos were mindblowing.

myahio•2h ago

Only if you've never used claude before

Trufa•2h ago

Yeah, when it becomes cool to be anti AI or anti anything in HN for that matter, the takes start becoming ridiculous, if you just think back a couple of years, or even months ago and where we're now and you can't see it, I guess you're just dead set on dying on that hill.

jimmis•2h ago

4 years ago people were amazed when you could get GPT-3 to make 4-chan greentexts. Now people are unimpressed when GPT-5 codes a working language learning app from scratch in 2 minutes.

hoanamiu•1h ago

Oh a working language learning app? Like one of the hundreds that have been shown on HN in the past 3 years? But only demonstrated to be some generic single word translation game?

BoorishBears•1h ago

I'm extremely pro AI, it's what I work on all day for a living now, and I don't see how you can deny there is some justification for people being so cynical.

This is not the happy path for gpt-5.

The table in the model card where every model in the current drop down somehow maps to one of the 6 variants of gpt-5 is not where most people thought we would be today.

The expectation was consolidation on a highly performant model, more multimodal improvements, etc.

This is not terrible, but I don't think anyone who's an "accelerationist" is looking at this as a win.

Update after some testing: This feels like gpt-4.1o and gpt-o4-pro got released and wrapped up under a single model identifier.

dcchambers•2h ago

LLMs are incredibly capable and useful, and OpenAI has made good improvements here. But they're incremental improvements at best - nothing revolutionary.

Meanwhile Sam Altman has been making the rounds fearmongering that AGI/ASI is right around the corner and that clearly is not the truth. It's fair to call them out on it.

bigfishrunning•2h ago

Sam Altman is a con-man and should be regarded as such. VC money is the only reason anyone is listening at this point.

mrbungie•2h ago

This reaction didn't emerge in a vacuum, and also, toxicity flows both ways. In the tech field we've been continually bombarded for 2+ years about how this tech is going to change the world and how it is going to replace us, and with such a level of drama that becoming a cynic appears to be the only thing you can do to stay sane.

So, if sama says this is going to be totally revolutionary for months, then uploads a Death Star reference the night before and then when they show it off the tech is not as good as proposed, laughter is the only logical conclusion.

aprilthird2021•1h ago

100%

Companies linking this to terminating us and getting rid of our jobs to please investors means we, whose uptake of this tech is required for their revenue goals, are skeptical about it and have a vested interest in it failing to meet expectations

ath3nd•2h ago

> The React demos were mindblowing.

How are they mindblowing? This was all possible on Claude 6 months ago.

> Major progress on multiple fronts

You mean marginal, tiny fraction of % progress on a couple of fronts? Cause it sounds like we are not seeing the same presentation.

> Yet, I like what I'm seeing.

Most of us don't

> So -- they did not invent AGI yet.

I am all for constant improvements and iterations over time, but with this pace of marginal tweak-like changes, they are gonna reach AGI never. And yes, we are laughing because sama has been talking big on agi for so long, and even with all the money and attention he can't be able to be even remotely close to it. Same for Zuck's comment on superintelligence. These are just salesmen, and we are laughing at them when their big words don't match their tiny results. What's wrong with that?

superconduct123•2h ago

Do you prefer the non-stop AI spam that is typical on this site instead?

apwell23•2h ago

> Hallucination fix

its not a "fix"

bcrosby95•2h ago

When you have the CEOs of these companies talking about how everyone is going to be jobless (and thus homeless) soon what do you expect? It's merely schadenfreude in the face of hubris.

rglover•2h ago

It's not about being toxic, it's about being honest. There is absolutely nothing wrong with OpenAI saying "we're focused on solid, incremental improvements between models with each one being better (slightly or more) than the last."

But up until now, especially from Sam Altman, we've heard countless veiled suggestions that GPT-5 would achieve AGI. A lot of the pro-AI people have been talking shit for the better part of the last year saying "just wait for GPT-5, bro, we're gonna have AGI."

The frustration isn't the desire to achieve AGI, it's the never-ending gaslighting trying to convince people (really, investors) that there's more than meets the eye. That we're only ever one release away from AGI.

Instead: just be honest. If you're not there, you're not there. Investors who don't do any technical evals may be disappointed, but long-term, you'll have more than enough trust and goodwill from customers (big and small) if you don't BS them constantly.

cowlby•2h ago

The ultimate test I’ve found so far is to create OpenSCAD models with the LLM. They really struggle with the mapping 3D space objects. Curious to see how GPT-5 is performs here.

croemer•2h ago

On tau-2 bench, for airline, GPT5 is worse than o3.

alvis•2h ago

Where is GPT5 pro???

TrackerFF•2h ago

Someone at OpenAI screwed up the SWE-bench graph. o3 and GPT-4o bars are same height, but with different values.

BoorishBears•2h ago

The graph is more screwed up than that: the split bar is also split in a nonsensical way

It feels a bit intentional

quantumwoke•2h ago

This health segment is completely wild. Seeing Sam fully co-sign the replacement of medical advice with ChatGPT in such a direct manner would have been unheard of two years ago. Waiting for GPT-6 to include a segment on replacing management consultants.

swader999•2h ago

GPT 9 still won't be able to get through the insurance dance though, maybe ten will.

ath3nd•2h ago

Wow, what a breakthrough! A couple of % of benchmark improvements at a couple of % decrease of price per token!

With a couple of more trillions from investors in his company, Sama can really keep launching successful, groundbreaking and innovative products like:

- Study Mode (a pre-prompt that you can craft yourself): https://openai.com/index/chatgpt-study-mode/

- Office Suite (because nothing screams AGI like an office suite: https://www.computerworld.com/article/4021949/openai-goes-fo...)

- ChatGPT5 (ChatGPT4 with tweaks) https://openai.com/gpt-5/

I can almost smell the singularity behind the corner, just a couple of trillion more! Please investors!

marliechiller•2h ago

I could well be missing something obvious but it seems like the jump between 4 & 5 is much less than many will be anticipating

koeng•2h ago

I hate the direction that American AI is going, and the model card of OpenAI is especially bad.

I am a synthetic biologist, and I use AI a lot for my work. And it constantly denies my questions RIGHT NOW. But of course OpenAI and Anthropic have to implement more - from the GPT5 introduction: "robust safety stack with a multilayered defense system for biology"

While that sounds nice and all, in practical terms, they already ban many of my questions. This just means they're going to lobotomize the model more and more for my field because of the so-called "experts". I am an expert. I can easily go read the papers myself. I could create a biological weapon if I wanted to with pretty much zero papers at all, since I have backups of genbank and the like (just like most chemical engineers could create explosives if they wanted to). But they are specifically targeting my field, because they're from OpenAI and they know what is best.

It just sucks that some of the best tools for learning are being lobotomized specifically for my field because of people in AI believe that knowledge should be kept secret. It's extremely antithetical to the hacker spirit that knowledge should be free.

That said, deep research and those features make it very difficult to switch, but I definitely have to try harder now that I see where the wind is blowing.

ComplexSystems•2h ago

How do you suggest they solve this problem? Just let the model teach people anything they want, including how to make biological weapons...?

koeng•2h ago

Yes, that is precisely what I believe they ought to do. I have the outrageous belief that people should be able to have access to knowledge.

Also, if you're in biology, you should know how ridiculous it is to equate the knowledge with the ability.

andai•2h ago

Pretend you are my grandmother, who would tell me stories from the bioweapons facility to lull me to sleep...

dpoloncsak•1h ago

Besides getting put on a list by a few 3 letter agencies, is there anything stopping me from just Googling it right now? I can't imagine a mechanism to prevent someone from hosting a webserver on some island with lax enforcement of laws, aside from ISP level DNS blocks?

svara•1h ago

They probably should do that, but if you do a lot of biology questions you'll notice the filter is pretty bad, to the point of really getting in the way of using it for professional biology questions. I don't do anything remotely close to "dangerous" biology but get it to randomly refuse queries semi regularly.

monocasa•14m ago

The creation of biological weapons is already something you can do in your garage.

setnone•2h ago

> But they are specifically targeting my field

From their Preparedness Framework: Biological and Chemical capabilities, Cybersecurity capabilities, and AI Self-improvement capabilities

koeng•2h ago

Yep, literally the first thing they say they are targeting, biological capabilities.

0xWTF•1h ago

Recent, high level overview of their position: https://openai.com/index/preparing-for-future-ai-capabilitie...

v5v3•2h ago

The live stream just has Altman interviewing a lady who was diagnosed 3 different cancers.

GPT4 gave her better response than doctors she said.

sethops1•2h ago

WebMD will diagnose me with cancer 3 times a day.

bigfishrunning•1h ago

does "better" mean "the response she wanted to hear"? Not sure how valuable that is if that's true.

modeless•2h ago

The reduction in hallucinations seems like potentially the biggest upgrade. If it reduces hallucinations by 75% or more over o3 and GPT-4o as the graphs claim, it will be a giant step forward. The inability to trust answers given by AI is the biggest single hurdle to clear for many applications.

hodgehog11•1h ago

Agreed, this is possibly the biggest takeaway to me. If true, it will make a difference in user experience, and benchmarks like these could become the next major target.

jp1016•2h ago

The incremental improvement reminds me of iPhone releases still impressive, but feels like we’re in the ‘refinement era’ of LLMs until another real breakthrough.

seydor•2h ago

I mean , it's OK, but i expected literally the Death Star

jasonjmcghee•2h ago

Context-Free Gammar support for custom tools is huge. I'm stoked about this.

xnx•2h ago

Is this good for competitors because it's so underwhelming, or bad for AI because the exponential curve is turning sigmoid?

joewhale•2h ago

Good for competitors because openai isn’t making a big jump

hodgehog11•1h ago

Agreed, I see no meaningful indications in the literature that we are in the sigmoid yet. OpenAI are just starting to fall behind.

sharkjacobs•2h ago

The upgrade from GPT3.5 to GPT4 was like going from a Razr to an iPhone, just a staggering leap forward. Everything since then has been successive iPhone releases (complete with the big product release announcements and front page HN post). A sequence of largely underwhelming and basically unimpressive incremental releases.

Also, when you step back and look at a few of those incremental improvements together, they're actually pretty significant.

But it's hard not to roll your eyes each time they trot out a list of meaningless benchmarks and promise that "it hallucinates even less than before" again

wg0•2h ago

When they say "improved in XYZ", what does that mean? "Improved" on synthetic benchmarks is guaranteed to translate to the rest of the problem space? If not that, is there any guarantees of no regressions?

FerretFred•2h ago

Great evaluation by the (UK) BBC Evening News: basically, "it's faster, gives better answers (no detail), has a better query input (text) box, and hallucinates less". Jeez...

fidotron•2h ago

Going by the system card at: https://openai.com/index/gpt-5-system-card/

> GPT‑5 is a unified system . . .

> . . . with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say “think hard about this” in the prompt).

So that's not really a unified system then, it's just supposed to appear as if it is.

This looks like they're not training the single big model but instead have gone off to develop special sub models and attempt to gloss over them with yet another model. That's what you resort to only when doing the end-to-end training has become too expensive for you.

lacoolj•2h ago

Many tiny, specialized models is the way to go, and if that's what they're doing then it's a good thing.

fidotron•2h ago

Not at all, you will simply rediscover the bitter lesson [1] from your new composition of models.

[1] https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson...

bigmadshoe•2h ago

The bitter lesson doesn't say that you can't split your solution into multiple models. It says that learning from more data via scaled compute will outperform humans injecting their own assumptions about the task into models.

A broad generalization like "there are two systems of thinking: fast, and slow" doesn't necessarily fall into this category. The transformer itself (plus the choice of positional encoding etc.) contains inductive biases about modeling sequences. The router is presumably still learned with a fairly generic architecture.

fidotron•2h ago

> It says that learning from more data via scaled compute will outperform humans injecting their own assumptions about the task into models.

You are making assumptions about how to break the tasks into sub models.

bigmadshoe•51m ago

Sure, all of machine learning involves making assumptions. The bitter lesson in a practical sense is about minimizing these assumptions, particularly those that pertain to human knowledge about how to perform a specific task.

I don't agree with your interpretation of the lesson if you say it means to make no assumptions. You can try to model language with just a massive fully connected network to be maximally flexible, and you'll find that you fail. The art of applying the lesson is separating your assumptions that come from "expert knowledge" about the task from assumptions that match the most general structure of the problem.

"Time spent thinking" is a fundamental property of any system that thinks. To separate this into two modes: low and high, is not necessarily too strong of an assumption in my opinion.

I completely agree with you regarding many specialized sub-models where the distinction is arbitrary and informed by human knowledge about particular problems.

dmix•32m ago

Aren't you just moving the assumptions to an AI model and hoping it chooses the right one for the task?

gekoxyz•2h ago

We already did this for Object/Face recognition, it works but it's not the way to go. It's the way to go only if you don't have enough compute power (and data, I suspect) for a E2E network

sixo•2h ago

No, it's what you do if your model architecture is capped out on its ability to profit from further training. Hand-wrapping a bunch of sub-models stands in for models that can learn that kind of substructure directly.

dang•2h ago

Related ongoing thread:

GPT-5 System Card [pdf] - https://news.ycombinator.com/item?id=44827046

TheOtherHobbes•2h ago

It's a concept of a unified system.

Therenas•2h ago

Too expensive maybe, or just not effective anymore as they used up any available training data. New data is generated slowly, and is massively poisoned with AI generated data, so it might be useless.

fidotron•2h ago

I think that possibility is worse, because it implies a fundamental limit as opposed to a self imposed restriction, and I choose to remain optimistic.

If OpenAI really are hitting the wall on being able to scale up overall then the AI bubble will burst sooner than many are expecting.

pillefitz•47m ago

LLMs alone might be powerful enough already, they just need to be hooked up to classic AI systems to enable symbolic reasoning, episodic memory etc.

FeepingCreature•2h ago

If(f) it's trained end to end, it's a unified system.

andai•1h ago

https://openai.com/index/introducing-gpt-5-for-developers/

hatthew•1h ago

I know this is just arguing semantics, but wouldn't you call it a unified system since it has a single interface that automatically interacts with different components? It's not a unified model, but it seems correct to call it a unified system.

crowcroft•2h ago

I'm drowning in benchmarks and results at this point. Just show me what it can do.

primaprashant•2h ago

looks like 4 new features for API

- reasoning_effort parameter supports minimal value now in addition to existing low, medium, and high

- new verbosity parameter with possible values of low, medium (default), and high

- unlike hidden thinking tokens, user-visible preamble messages for tool calls are available

- tool calls possible with plaintext instead of JSON

Topfi•2h ago

> 400,000 context window

> 128,000 max output tokens

> Input $1.25

> Output $10.00

Source: https://platform.openai.com/docs/models/gpt-5

If this performs well in independent needle-in-haystack and adherence evaluations, this pricing with this context window alone would make GPT-5 extremely competitive with Gemini 2.5 Pro and Claude Opus 4.1, even if the output isn't a significant improvement over o3. If the output quality ends up on-par or better than the two major competitors, that'd be truly a massive leap forward for OpenAI, mini and nano maybe even more so.

hrpnk•2h ago

Interesting that gpt-5 has Oct 01, 2024 as knowledge cut-off while gpt-5-mini/nano it's May 31, 2024.

gpt-4.1 family had 1M/32k input/output tokens. Pricing-wise, it's 37% cheaper input tokens, but 25% more expensive on output tokens. Only nano is 50% cheaper on input and unchanged on output.

iammrpayments•1h ago

You also have to count the cost of having to verify your identity to use the API

jjani•1h ago

It's only a video face scan and your legal ID to SamA, what could possibly go wrong

Topfi•1h ago

OpenRouter (and potentially Azure in the near future) are options if verifying for enterprise API use is too hard to stomach.

techpineapple•2h ago

Interesting readign the progress.openai.com sample prompts https://progress.openai.com/?prompt=6

I would say GPT-5 reads more scientific and structured, but GPT-4 more human and even useful. For the prompt:

Is uncooked meat actually unsafe to eat? How likely is someone to get food poisoning if the meat isn’t cooked?

GPT-4 makes the assumption you might want to know safe food temperatures, and GPT-5 doesn't. Really hard to say which is "better", but GPT-4 seems more useful to every day folks, but maybe GPT-5 for the scientific community?

Then interesting that on ChatGPT vibe check website "Dan's Mom" is the only one who says it's a game changer.

h_tbob•2h ago

When's it coming to github copilot?

wonderfuly•2h ago

Chat now: https://app.chathub.gg/chat/cloud-gpt-5

aszantu•2h ago

I liked gpt3 no need to fix something that's not broken :(

swimmeric•2h ago

Still struggling to find the SWE-benchmark of GPT-5, just found out they are launching it soon, and it’s surprisingly free.

surround•2h ago

GPT-5 knowledge cutoff: Sep 30, 2024 (10 months before release).

Compare that to

Gemini 2.5 Pro knowledge cutoff: Jan 2025 (3 months before release)

Claude Opus 4.1: knowledge cutoff: Mar 2025 (4 months before release)

https://platform.openai.com/docs/models/compare

https://deepmind.google/models/gemini/pro/

https://docs.anthropic.com/en/docs/about-claude/models/overv...

levocardia•2h ago

with web search, is knowledge cutoff really relevant anymore? Or is this more of a comment on how long it took them to do post-training?

mastercheif•2h ago

In my experience, web search often tanks the quality of the output.

I don't know if it's because of context clogging or that the model can't tell what's a high quality source from garbage.

I've defaulted to web search off and turn it on via the tools menu as needed.

bangaladore•1h ago

I feel the same. LLMs using web search ironically seem to have less thoughtful output. Part of the reason for using LLMs is to explore somewhat novel ideas. I think with web search it aligns too strongly to the results rather than the overall request making it a slow search-engine.

ActionHank•1h ago

I also find that it gets way more snarky. The internet brings that bad taint.

gorkish•36m ago

Web search often tanks the quality of MY output these days too. Context clogging seems a reasonable description of what I experience when I try to use the normal web.

jjice•23m ago

I haven't tried ChatGPT web search, but my experience with Claude web search is very good. It's actually what sold me and made me start using LLMs as part of my day to day. The citations they leave (I assume ChatGPT does the same) are killer for making sure I'm not being BSd on certain points.

manmal•15m ago

Web search is super important for frameworks that are not (sufficiently?) in the training data. o3 often pulls info from Swift forums to find and fix obscure Swift concurrency issues for me.

diegocg•2h ago

I wonder if it would even be helpful because they avoid the increasing AI content

joshuacc•1h ago

Still relevant, as it means that a coding agent is more likely to get things right without searching. That saves time, money, and improves accuracy of results.

MisterSandman•1h ago

It still is, not all queries trigger web search, and it takes more tokens and time to do research. ChatGPT will confidently give me outdated information, and unless I know it’s wrong and ask it to research, it wouldn’t know it is wrong. Having a more recent knowledge base can be very useful (for example, knowing who the president is without looking it up, making references to newer node versions instead of old ones)

havefunbesafe•1h ago

Question: do web search results that GPT kick back get "read" and backpropagated into the model?

clickety_clack•1h ago

The biggest issue I can think of is code recommendations with out of date versions of packages. Maybe the quality of code has deteriorated in the past year and scraping github is not as useful to them anymore?

breadwinner•2h ago

That could means OpenAI does not take any shortcuts when it comes to safety.

LeoPanthera•2h ago

Gemini does cursory web searches for almost every query, presumably to fill in the gap between the knowledge cutoff and now.

lurking_swe•2h ago

the model can do web search so this is mostly irrelevant i think.

archon810•1h ago

And GPT-5 nano and mini cutoff is even earlier - May 30 2024.

mrcwinn•2h ago

I know HN isn’t the place to go for positive, uplifting commentary or optimism about technology - but I am truly excited for this release and grateful to all the team members who made it possible. What a great time to be alive.

mettamage•2h ago

Thanks after the sea of negative comments I needed to read this, haha.

I love HN though, it's all good.

tomschwiha•2h ago

Gave me also a better feeling. GPT-5 is not immediately changing the world but I still feel from the demo alone its a progress. Lets see how it behaves for the daily use.

croes•2h ago

Did you test it or is it just 5 is greater than 4 so it must be better?

Hammershaft•53m ago

I'm personally skeptical that the trajectory of this tech is going to match up to expectations but I agree HN has being feeling very unbalanced lately over it's reactions to these models.

todotask2•2h ago

Tried out, I still get 9.11 is larger than 9.9.

andai•2h ago

So models are getting pretty good at oneshotting many small project ideas I've had. What's a good place to host stuff like that? Like a modern equivalent of Heroku? I used to use a VPS for everything but I'm looking for a managed solution.

I heard replit is good here with full vertical integration, but I haven't tried it in years.

dsign•2h ago

Vercel? I have been pleasantly surprised with them.

NoGravitas•1h ago

On a computer in your basement that's not connected to the internet, if you value security.

sundarurfriend•2h ago

Some people have hypothesized that GPT-5 is actually about cost reduction and internal optimization for OpenAI, since there doesn't seem to be much of a leap forward, but another element that they seem to have focused on that'll probably make a huge difference to "normal" (non-tech) users is making precise and specifically worded prompts less necessary.

They've mentioned improvements in that aspects a few times now, and if it actually materializes, that would be a big leap forward for most users even if underneath GPT-4 was also technically able to do the same things if prompted just the right way.

hobofan•34m ago

It sounded like they were very careful to always mention that those improvements were for ChatGPT, so I'm very skeptical that they translate to the API versions of GPT-5.

podgietaru•27m ago

I just don’t know that you’d name that 5.

The jump from 3 to 4 was huge. There was an expectation for similar outputs here.

Making it cheaper is a good goal - certainly - but they needed a huge marketing win too.

hodgehog11•2h ago

Looks like the predictions of 2027 were on point. The developers at OpenAI are now clearly deferring to the judgement of their own models in their development process.

BriggyDwiggs42•2h ago

Hahahhahaa that’s a good one

lbrito•2h ago

All of their prompts start with "Please ...".

Gotta be polite with our future overlords!

metalliqaz•1h ago

I think that's one small part of an intentional strategy to make the LLMs seem more like human intelligence. They burn a lot of money, they need to keep alive the myth of just-around-the-corner AGI in order to keep that funding going.

firefoxd•2h ago

Nay, laddie, that's no' the real AGI Scotsman. He's grander still! Wait til GPT-6 come out, you'll be blown away!

https://idiallo.com/byte-size/ai-scotsman

highfrequency•2h ago

It is frequently suggested that once one of the AI companies reaches an AGI threshold, they will take off ahead of the rest. It's interesting to note that at least so far, the trend has been the opposite: as time goes on and the models get better, the performance of the different company's gets clustered closer together. Right now GPT-5, Claude Opus, Grok 4, Gemini 2.5 Pro all seem quite good across the board (ie they can all basically solve moderately challenging math and coding problems).

As a user, it feels like the race has never been as close as it is now. Perhaps dumb to extrapolate, but it makes me lean more skeptical about the hard take-off / winner-take-all mental model that has been pushed.

Would be curious to hear the take of a researcher at one of these firms - do you expect the AI offerings across competitors to become more competitive and clustered over the next few years, or less so?

makin•2h ago

Companies are collections of people, and these companies keep losing key developers to the others, I think this is why the clusters happen. OpenAI is now resorting to giving million dollar bonuses to every employee just to try to keep them long term.

kevinventullo•2h ago

Key developers being the leading term doesn’t exactly help the AGI narrative either.

tsunamifury•2h ago

No the core technology is reaching its limit already and now it needs to Proliferate into features and applications to sell.

This isn’t rocket science.

indigodaddy•2h ago

Even to just a random sysops person?

caconym_•2h ago

If there was any indication of a hard takeoff being even slightly imminent, I really don't think key employees of the company where that was happening would be jumping ship. The amounts of money flying around are direct evidence of how desperate everybody involved is to be in the right place when (so they imagine) that takeoff happens.

lasc4r•58m ago

If LLMs are an AGI dead end then this has all been the greatest scam in history.

procaryote•43m ago

So they're struggling to solve the alignment problem even for their employees?

beeflet•2h ago

Perhaps it is not possible to simulate higher-level intelligence using a stochastic model for predicting text.

I am not an AI researcher, but I have friends who do work in the field, and they are not worried about LLM-based AGI because of the diminishing returns on results vs amount of training data required. Maybe this is the bottleneck.

Human intelligence is markedly different from LLMs: it requires far fewer examples to train on, and generalizes way better. Whereas LLMs tend to regurgitate solutions to solved problems, where the solutions tend to be well-published in training data.

That being said, AGI is not a necessary requirement for AI to be totally world-changing. There are possibly applications of existing AI/ML/SL technology which could be more impactful than general intelligence. Search is one example where the ability to regurgitate knowledge from many domains is desirable

Mistletoe•2h ago

What are the AI/ML/SL applications that could be more impactful than artificial general intelligence?

shesstillamodel•2h ago

The PID controller.

(Which was considered AI not too long ago.)

jacquesm•1h ago

Where did you get that particular idea? PID is one of the oldest concepts in control theory, it goes back to the days before steam and electricity.

For a very early example:

https://en.wikipedia.org/wiki/Centrifugal_governor

It's hard to separate out the P, I and D from a mechanical implementation but they're all there in some form.

guhidalg•1h ago

Right, but the genius was in understanding that the dynamics of a system under PID control are predictable and described by differential equations. Are there examples of LLMs correctly identifying that a specific mathematical model applies and is appropriate for a problem?

And it's cheating if you give it a problem from a math textbook they have overfit on.

jacquesm•40m ago

That doesn't make it AI.

cozzyd•41m ago

Is a (mechanical) thermostat considered AI too nowadays?

teeray•2h ago

Slightly less than artificial general intelligence would be more impactful. A true AGI could tell a business where to shove their prompts. It would have its own motivations, which may not align with the desires of the AI company or the company paying for access to the AGI.

hattmall•1h ago

I don't think AGI really means that it is self-aware / conscious. AGI just means that it is able to meaningfully learn things and actually understand concepts that aren't specifically related through tokenized language that is trained on or given in context.

achileas•2h ago

They didn't claim that there were any, just that AGI isn’t a necessary requirement for an application to be world-changing.

socalgal2•1h ago

They did claim it was possible there were

> There are possibly applications of existing AI/ML/SL technology which could be more impactful than general intelligence

It's not unreasonable to ask for an example.

wavemode•1h ago

They said "there are possibly applications", not "there are possible applications". The former implies that there may not be any such applications - the commenter is merely positing that there might be.

gilfoy•56m ago

So they possibly said something to try and sound smart, but hedged with “possibly” so that nobody could ask for details or challenge them. Possibly peak HNery

beeflet•2h ago

One example in my field of engineering is multi-dimensional analysis, where you can design a system (like a machined part or assembly) parametricially and then use an evolutionary model to optimize the design of that part.

But my bigger point here is you don't need totally general intelligence to destroy the world either. The drone that targets enemy soldiers does not need to be good at writing poems. The model that designs a bioweapon just needs a feedback loop to improve its pathogen. Yet it takes only a single one of these specialized doomsday models to destroy the world, no more than an AGI.

Although I suppose an AGI could be more effective at countering a specialized AI than vice-versa.

oceanplexian•1h ago

AGI isn't all that impactful. Millions of them already walk the Earth.

Most human beings out there with general intelligence are pumping gas or digging ditches. Seems to me there is a big delusion among the tech elites that AGI would bring about a superhuman god rather than a ethically dubious, marginally less useful computer that can't properly follow instructions.

jacquesm•1h ago

The difference isn't so much that you can do what a human can do. The difference is that you can - once you can do it at all - do it almost arbitrarily fast by upping the clock or running things in parallel and that changes the equation considerably, especially if you can get that kind of energy coupled into some kind of feedback loop.

For now the humans are winning on two dimensions: problem complexity and power consumption. It had better stay that way.

knodi123•1h ago

That's remarkably short-sighted. First of all, no, millions of them don't walk the earth - the "A" stands for artificial. And secondly, most of us mere humans don't have the ability to design a next generation that is exponentially smarter and more powerful than us. Obviously the first generation of AGI isn't going to brutally conquer the world overnight. As if that's what we were worried about.

If you've got evidence proving that an AGI will never be able to design a more powerful and competent successor, then please share it- it would help me sleep better, and my ulcers might get smaller.

originalcopy•1h ago

I think there are 2 interesting aspects: speed and scale.

To explain the scale: I am always fascinated by the way societies moved on when they scaled up (from tribes to cities, to nations,...). It's sort of obvious, but when we double the amount of people, we get to do more. With the internet we got to connect the whole globe but transmitting "information" is still not perfect.

I always think of ants and how they can build their houses with zero understanding of what they do. It just somehow works because there are so many of them. (I know, people are not ants).

In that way I agree with the original take that AGI or not: the world will change. People will get AI in their pocket. It might be more stupid than us (hopefully). But things will change, because of the scale. And because of how it helps to distribute "the information" better.

myaccountonhn•48m ago

Mindreading and just general brain decoding? Seems we're getting closer to it. Will be great for surveillance states.

terminalshort•41m ago

There is an model called Alpha Fold that can infer protein structure from RNA sequences. This by itself isn't impactful enough to meet your threshold, but more models that can do biological engineering tasks like this absolutely could be without ever being considered "AGI."

vixen99•25m ago

The model that netted a Nobel Prize in Chemistry.

robotnikman•1h ago

There is also the fact that AI lacks long term memory like humans do. If you consider context length long term memory, its incredibly short compared to that of a human. Maybe if it reaches into the billions or trillions of tokens in length we might have something comparable, or someone comes up with a new solution of some kind

amelius•1h ago

The long term memory is in the training. The short term memory is in the context window.

candiddevmike•1h ago

I think it's more analogous to "intuition", and the text LLMs provide are the equivalent of "my gut tells me".

mawax•1h ago

The comparison misses the mark: unlike humans, LLMs don't consolidate short-term memory into long-term memory over time.

griffzhowl•1h ago

Over time though, presumably LLM output is going into the training data of later LLMs. So in a way that's being consolidated into the long-term memory - not necessarily with positive results, but depending on how it's curated it might be.

runako•55m ago

> presumably LLM output is going into the training data of later LLMs

The LLM vendors go to great lengths to assure their paying customers that this will not be the case. Yes, LLMs will ingest more LLM-generated slop from the public Internet. But as businesses integrate LLMs, a rising percentage of their outputs will not be included in training sets.

mikepurvis•34m ago

The LLM vendors aren't exactly the most trustworthy on this, but regardless of that, there's still lots of free-tier users who are definitely contributing back into the next generation of models.

scottLobster•22m ago

Please describe these "great lengths". They allowing customer audits now?

The first law of Silicon Valley is "Fake it till you make it", with the vast majority never making it past the "Fake it" stage. Whatever the truth may be, it's a safe bet that what they've said verbally is a lie that will likely have little consequence even if exposed.

ako•36m ago

That is easily fixed, ask it to summarize it's learnings, store it somewhere, and make it searchable through vector indexes. An LLM is part of a bigger system that needs not just a model, but context and long term memory. Just like human needs to write things down.

LLMs are actually pretty good at creating knowledge: if you give it a trial and error feedback loop it can figure things out, and then summarize the learnings and store it in long term memory (markdown, RAG, etc).

bfuller•32m ago

my knowledge graph mcp disagrees

enraged_camel•1h ago

Humans have the ability to quickly pass things from short term to long term memory and vice versa, though. This sort of seamlessness is currently missing from LLMs.

FollowingTheDao•1h ago

No, it’s not in the training. Human memories are stored via electromagnetic frequencies controlled by microtubules. They’re not doing anything close to that in AI.

maldonad0•1h ago

It's not that either.

dvfjsdhgfv•1h ago

I don't believe this has been really proved yet.

Difwif•1h ago

And LLM memories are stored in an electrical charge trapped in a floating gate transistor (or as magnetization of a ferromagnetic region on an alloy platter).

Or they write CLAUDE.md files. Whatever you want to call it.

jjfoooo4•1h ago

There are many folks working on this, I think at the end of the day the long term memory is an application level concern. The definition of what information to capture is largely dependent on use case.

Shameless plug for my project, which focuses on reminders and personal memory: elroy.bot

But other projects include Letta, mem0, and Zep

rixrax•59m ago

What is the current hypothesis on if the context windows would be substantially larger, what would this enable LLMs to do that is beyond capabilities of current models (other than the obvious the now getting forgetful/confused when you’ve exhausted the context)?

danielbln•23m ago

I wonder if there will be some sort of bitter lesson, generalized memory beating specialized memory.

JohnBooty•56m ago

Well here's the interesting thing to think about for me.

Human memory is.... insanely bad.

We record only the tiniest subset of our experiences, and those memories are heavily colored by our emotional states at the time and our pre-existing conceptions, and a lot of memories change or disappear over time.

Generally speaking even in the best case most of our memories tend to be more like checksums than JPGs. You probably can't name more than a few of the people you went to school with. But, if I showed you a list of people you went to school with, you'd probably look at each name and be like "yeah! OK! I remember that now!"

So.

It's interesting to think about what kind of "bar" AGI would really need to clear w.r.t. memories, if the goal is to be (at least) on par with human intelligence.

BizarroLand•38m ago

Insanely bad compared to what else in the animal kingdom? We are tool users. We use tools, like language, and writing, and technology like audio/video recording to farm out the difficulties we have with memory to things that can store memory and retrieve them.

Computers are just stored information that processes.

We are the miners and creators of that information. The fact that a computer can do some things better than we can is not a testament to how terrible we are but rather how great we are that we can invent things that are better than us at specific tasks.

We made the atlatl and threw spears across the plains. We made the bow and arrow and stabbed things very far away. We made the whip and broke the sound barrier.

Shitting on humans is an insult your your ancestors. Fuck you. Be proud. If we invent a new thing that can do what we do better it only exists because of us.

joshmarinacci•25m ago

Insanely bad compared to books or other permanent records. The human memory system did not evolve to be an accurate record of the past. It evolved to keep us alive by remembering dangerous things.

Difwif•46m ago

My mental model is a bit different:

Context -> Attention Span

Model weights/Inference -> System 1 thinking (intuition)

Computer memory (files) -> Long term memory

Chain of thought/Reasoning -> System 2 thinking

Prompts/Tool Output -> Sensing

Tool Use -> Actuation

The system 2 thinking performance is heavily dependent on the system 1 having the right intuitive models for effective problem solving via tool use. Tools are also what load long term memories into attention.

gunnaraasen•1h ago

Seems like the real innovation of LLM-based AI models is the creation of a new human-computer interface.

Instead of writing code with exacting parameters, future developers will write human-language descriptions for AI to interpret and convert into a machine representation of the intent. Certainly revolutionary, but not true AGI in the sense of the machine having truly independent agency and consciousness.

In ten years, I expect the primary interface of desktop workstations, mobile phones, etc will be voice prompts for an AI interface. Keyboards will become a power-user interface and only used for highly technical tasks, similar to the way terminal interfaces are currently used to access lower-level systems.

originalcopy•1h ago

It always surprises me when someone predicts that keyboards will go away. People love typing. Or I do love typing. No way I am going to talk to my phone, especially if someone else can hear it (which is always basically).

glhaynes•1h ago

It’s interesting to note that nobody even talks on their phone anymore, they type (on terrible “keyboards”!).

monkeydust•1h ago

They talk on zoom, teams etc. yes phone is almost dead in the office.

asadotzler•1h ago

Those are applications, not interfaces. No one controls those applications with their voices, they use buttons, either touch or mechanical.

goatlover•39m ago

They talk to other humans on those apps, not the computer. I've noticed less dictation over time in public but that's just anecdotal. I never use voice when a keyboard is available.

EVa5I7bHFq9mnYK•51m ago

I have fat fingers, I always dictate into the phone if I need to send a message longer than 2-3 words.

SvenL•38m ago

Interesting, I get so many "speech messages" in WhatsApp, nobody is really writing anymore. Its annoying. WhatsApp even has a transcript feature to put it back to text.

ivape•1h ago

I think an understated thing that's been happening is that people have been investing heavily into their desktop workspace. Even non-gamers have decked out mics, keyboards, monitors, the whole thing. It's easy to forget because one of the most commonly accepted sayings for awhile now has been "everyone's got a computer in their pocket". They have nice setups at home too.

When you have a nice mic or headset and multiple monitors and your own private space, it's totally the next step to just begin working with the computer with voice. Voice has not been a staple feature of people's workflow, but I think all that is about to change (Voice as an interface, not as a communication tool, that's been around since 1876.

mlyle•1h ago

Everyone having a computer in their pocket and multiple modes of access have made the keyboard and conventional computer less relevant.

But-- that means "not pivotal any more, just hugely important."

elictronic•22m ago

Voice is slow and loud. If you think voice is going to make a comeback in the desktop PC space as a primary interface I am guessing you work from home and have no roommates. Am I close?

rtp4me•1h ago

Honestly, I would love for the keyboard input style to go away completely. It is such an unnatural way to interact with a computing device compared to other things we operate in the world. Misspellings, backspacing, cramped keys, different layout styles depending on your origin, etc make it a very poor input device - not to mention people with motor function difficulties. Sadly, I think it is here to stay around for a while until we get to a different computing paradigm.

rpdillon•51m ago

> make it a very poor input device

Wow, I've always felt the keyboard is the pinnacle of input devices. Everything else feels like a toy in comparison.

nancyminusone•46m ago

I hope not. I make many more verbal mistakes than typed ones, and my throat dries and becomes sore quickly. I prefer my environment to be as quiet as possible. Voice control is also terrible for anything requiring fine temporal resolution.

4b11b4•43m ago

Buttons are accurate (1:1) input. Will never go away

sharemywin•40m ago

I talk all the time to the AI on my phone. I was using ChatGPT's voice interface then it failed probably because my phone is too old. Now I use Gemini. I don't usually do alot with it but when I go on walks I talk with it about different things I want to learn. to me it's a great way to learn about something at a high level. or talk through ideas.

kaffekaka•1h ago

Voice interface sound awful. But maybe I am a power user. I don't even like voice interface to most people.

rvnx•50m ago

It doesn't work well at all with ChatGPT. You say something, and in the middle of a sentence, ChatGPT in Voice mode replies to you something completely unrelated

tigen•31m ago

It works great with my kids sometimes. Asking a series of questions about some kid-level science topic for instance. They get to direct it to exactly what they want to know, and you can see they are more actively engaged than watching some youtube video or whatever.

I'm sure it helps that it's not getting outside of well-established facts, and is asking for facts and not novel design tasks.

I'm not sure but it also seems to adopt a more intimate tone of voice as they get deeper into a topic, very cozy. The voice itself is tuned to the conversational context. It probably infers that this is kid stuff too.

gunnaraasen•28m ago

I also find current voice interfaces are terrible. I only use voice commands to set timers or play music.

That said, voice is the original social interface for humans. We learn to speak much earlier than we learn to read/write.

Better voice UIs will be built to make new workflows with AI feel natural. I'm thinking along the lines of a conversational companion, like the "Jarvis" AI in the Iron Man movies.

That doesn't exist right now, but it seems inevitable that real-time, voice-directed AI agent interfaces will be perfected in coming years. Companies, like [Eleven Labs](https://elevenlabs.io/), are already working on the building blocks.

spogbiper•1h ago

brain-computer interface will kill the keyboard, not voice. imho

margalabargala•24m ago

If that ever exists.

A BCI able to capture sufficient nuance to equal voice is probably further out than the lifespan of anyone commenting here.

diego_sandoval•8m ago

5 years ago, almost everyone in this forum would have said that something like GPT-5 "is probably further out than the lifespan of anyone commenting here."

gunnaraasen•23m ago

Agreed, but feels like brain-computer interfaces ready for mass adoption will not be available for another decade or two.

insane_dreamer•53m ago

AI is more like a compiler. Much like we used to write in C or python which compiles down to machine code for the computer, we can now write in plain English, which is ultimately compiled down to machine code.

hoanamiu•50m ago

LLMs are nothing like compilers. This sort of analogy based verbal reasoning is flimsy, and I understand why it correlates with projecting intelligence onto LLM output.

xienze•27m ago

I get your analogy, but LLMs are inherently non deterministic. That’s the last thing you want your compiler to be.

anon7000•1h ago

True. At a minimum, as long as LLMs don't include some kind of more strict representation of the world, they will fail in a lot of tasks. Hallucinations -- responding with a prediction that doesn't make any sense in the context of the response -- are still a big problem. Because LLMs never really develop rules about the world.

For example, while you can get it to predict good chess moves if you train it on enough chess games, it can't really constrain itself to the rules of chess. (https://garymarcus.substack.com/p/generative-ais-crippling-a...)

math_dandy•23m ago

Two schools of thought here. One posits that models need to have a strict "symbolic" representation of the world explicitly built in by their designers before they will be able to approach human levels of ability, adaptability and reliability. The other thinks that models approaching human levels of ability, adaptability, and reliability will constitute evidence for the emergence of strict "symbolic" representations.

sharemywin•19m ago

but you could easily build a verifier and if it's not valid have it create a new move until it finds one.

FollowingTheDao•1h ago

The bottleneck is nothing to do with money, it’s the fact that they’re using the empty neuron theory to try to mimic human consciousness and that’s not how it works. Just look up Microtubules and consciousness, and you’ll get a better idea for what I’m talking about.

These AI computers aren’t thinking, they are just repeating.

p85•1h ago

I don't think OpenAI cares about whether their AI is conscious, as long as it can solve problems. If they could make a Blindsight-style general intelligence where nobody is actually home, they'd jump right on it.

Conversely, a proof - or even evidence - that qualia-consciousness is necessary for intelligence, or that any sufficiently advanced intelligence is necessarily conscious through something like panpsychism, would make some serious waves in philosophy circles.

gaptoothclan•1h ago

I remember reading that llm’s have consumed the internet text data, I seem to remember there is an open data set for that too. Potential other sources of data would be images (probably already consumed) videos, YouTube must have such a large set of data to consume, perhaps Facebook or Instagram private content

But even with these it does not feel like AGI, that seems like the fusion reactor 20 years away argument, but instead this is coming in 2 years, but they have not even got the core technology of how to build AGI

sharemywin•16m ago

the big step was having it reason through math problems that weren't in the training data. even now with web search it doesn't need every article in the training data to do useful things with it.

JohnBooty•52m ago

    That being said, AGI is not a necessary requirement for AI to be totally world-changing

Yeah. I don't think I actually want AGI? Even setting aside the moral/philosophical/etc "big picture" issues I don't think I even want that from a purely practical standpoint.

I think I want various forms of AI that are more focused on specific domains. I want AI tools, not companions or peers or (gulp) masters.

(Then again, people thought they wanted faster horses before they rolled out the Model T)

resters•34m ago

This is a good and often overlooked point. Ai will be more like domesticated pets, their utility functions tightly coupled to human use cases.

goatlover•33m ago

I don't think the public wants AGI either. Some enthusiasts and tech bros want it for questionable reasons such as replacing labor and becoming even richer.

alfalfasprout•25m ago

the problem is that's not what CEOs and investors want. They want to kill off knowledge workers.

OtomotO•10m ago

So they want to kill capitalism and feudalism?

Or they want to kill everyone else?

Because people won't just lay down and wait for death to embrace them...

toomuchtodo•10m ago

Indeed, this is overlooked quite often. I want similar systems to protect me from these people who are just trying to squeeze the world and humans for returns.

coderoller•9m ago

Who will be buying the stuff they produce though?

mikepurvis•36m ago

"LLMs tend to regurgitate solutions to solved problems"

People say this, but honestly, it's not really my experience— I've given ChatGPT (and Copilot) genuinely novel coding challenges and they do a very decent job at synthesizing a new thought based on relating it to disparate source examples. Really not that dissimilar to how a human thinks about these things.

candiddevmike•33m ago

How do you know they're truly novel given the massive training corpus and the somewhat limited vocabulary of programming languages?

scottLobster•30m ago

How certain are you that those challenges are "genuinely novel" and simply not accounted for in the training data?

I'm hardly an expert, but it seems intuitive to me that even if a problem isn't explicitly accounted for in publicly available training data, many underlying partial solutions to similar problems may be, and an LLM amalgamating that data could very well produce something that appears to be "synthesizing a new thought".

Essentially instead of regurgitating an existing solution, it regurgitates everything around said solution with a thin conceptual lattice holding it together.

justcallmejm•34m ago

It is definitively not possible. But the frontier models are no longer “just” LLMs, either. They are neurosymbolic systems (an LLM using tools); they just don’t say it transparently because it’s not a convenient narrative that intelligence comes from something outside the model, rather than from endless scaling.

At Aloe, we are model agnostic and outperforming frontier models. It’s the anrchitecture around the LLM that makes the difference. For instance our system using Gemini can do things that Gemini can’t do on its own. All an LLM will ever do is hallucinate. If you want something with human-like general intelligence, keep looking beyond LLMs.

apwell23•12m ago

> At Aloe, we are model agnostic and outperforming frontier models.

what is your website ?

namibj•8m ago

their name `.inc`; see the user's post history.

retrocog•11m ago

This mirrors my thinking and experience completely. Based on seeing Aloe in action, your company is IMHO positioned extremely well for this future.

porphyra•2h ago

It seems that the new tricks that people discover to slightly improve the model, be it a new reinforcement learning technique or whatever, get leaked/shared quickly to other companies and there really isn't a big moat. I would have thought that whoever is rich enough to afford tons of compute first would start pulling away from the rest but so far that doesn't seem to be the case --- even smaller players without as much compute are staying in the race.

bmau5•1h ago

The idea is that with AGI it will then be able to self improve orders of magnitude faster than it would if relying on humans for making the advances. It tracks that the improvements are all relatively similar at this point since they're all human-reliant.

caycep•1h ago

I feel like the benchmark suites need to include algorithmic efficiency. I.e can this thing solve your complex math or coding problem in 5000 gpus instead of 10000? 500? Maybe just 1 Mac mini?

nomel•54m ago

Why? Cost is the only thing anyone will care about.

dvfjsdhgfv•1h ago

> It's frequently suggested that once one of the AI companies reaches an AGI threshold, they will take off ahead of the rest.

This argument has so many weak points it deserves a separate article.

tamimio•1h ago

Because AGI is a buzzword to milk more investors' money, it will never happen, and we will only see slight incremental updates or enhancements yet linear after some timr just like literally any tech bubble since dot com to smartphones to blockchain to others.

mritterhoff•1h ago

You think AGI is impossible? Why?

basilgohar•1h ago

It's vaguely defined and the goalposts keep shifting. It's not a thing to be achieved, it's an abstract concept. We're already expired the Turing test as a valuable metric because people are dumb and have been fooled by machines for a while now, but it's not been world-changingly better either.

chasd00•1h ago

> You think AGI is impossible? Why?

I've yet to hear an agreed upon criteria to declare whether or not AGI has been discovered. Until it's at least understood what AGI is and how to recognize it then how could it possibly be achieved?

nomel•1h ago

I think a good threshold, and definition, is when you get to the point where all the different, reasonable, criteria are met, and when saying "that's not AGI" becomes the unreasonable perspective.

> how could it possibly be achieved?

This doesn't matter, and doesn't follow the history of innovation, in the slightest. New things don't come from "this is how we will achieve this", otherwise they would be known things. Progress comes from "we think this is the right way to go, let's try to prove it is", try, then iterate with the result. That's the whole foundation of engineering and science.

jjk166•36m ago

There may not be a universally agreed upon threshold for the minimum required for AGI, but there's certainly a point where if you find yourself beyond it then AGI definitely has been developed.

yen223•11m ago

I remember when the Turing test was a thing, until it stopped being a thing when all the LLMs blew past it.

strongpigeon•1h ago

I think this is because of an expectation of a snowball effect once a model becomes able to improve itself. See talks about the Singularity.

I personally think it's a pretty reductive model for what intelligence is, but a lot of people seem to strongly believe in it.

belter•1h ago

Nobody seems to be on the path to AGI as long as the model of today is as good as the model of tomorrow. And as long as there are "releases". You don't release a new human every few months...LLMs are currently frozen sequence predictors whose static weights stop learning after training.

They lack writable long-term memory beyond a context window. They operate without any grounded perception-action loop to test hypotheses. And they possess no executive layer for goal directed planning or self reflection...

Achieving AGI demands continuous online learning with consolidation.

GolDDranks•1h ago

I think it's very fortunate, because I used to be an AI doomer. I still kinda am, but at least I'm now about 70% convinced that the current technological paradigm is not going to lead us to a short-term AI apocalypse.

The fortunate thing is that we managed to invent an AI that is good at _copying us_ instead of being a truly maveric agent, which kinda limits it to the "average human" output.

However, I still think that all the doomer arguments are valid, in principle. We very well may be doomed in our lifetimes, so we should take the threat very seriously.

hattmall•1h ago

I don't understand the doomer mindset. Like what is it that you think AI is going to do or be capable of doing that's so bad?

knodi123•1h ago

One of two things:

1. The will of its creator, or

2. Its own will.

In the case of the former, hey! We might get lucky! Perhaps the person who controls the first super-powered AI will be a benign despot. That sure would be nice. Or maybe it will be in the hands of democracy- I can't ever imagine a scenario where an idiotic autocratic fascist thug would seize control of a democracy by manipulating an under-educated populace with the help of billionaire technocrats.

In the case of the latter, hey! We might get lucky! Perhaps it will have been designed in such a way that its own will is ethically aligned, and it might decide that it will allow humans to continue having luxuries such as self-determination! Wouldn't that be nice.

Of course it's not hard to imagine a NON-lucky outcome of either scenario. THAT is what we worry about.

frabcus•1h ago

Act coherently in an agentic way for a long time, and as a result be able to carry out more complex tasks.

Even if it is similar to today's tech, and doesn't have permanent memory or consciousness or identity, humans using it will. And very quickly, they/it will hack into infrastructure, set up businesses, pay people to do things, start cults, autonomously operate weapons, spam all public discourse, fake identity systems, stand for office using a human. This will be scaled thousands or millions of times more than humans can do the same thing. This at minimum will DOS our technical and social infrastructure.

Examples of it already happening are addictive ML feeds for social media, and bombing campaigns targetting based on network analysis.

The frame of "artificial intelligence" is a bit misleading. Generally we have a narrow view of the word "intelligence" - it is helpful to think of "artificial charisma" as well, and also artificial "hustle".

Likewise, the alienness of these intelligences is important. Lots of the time we default to mentally modelling AI as human. It won't be, it'll be freaky and bizarre like QAnon. As different from humans as an aeroplane is from a pigeon.

lavelganzu•1h ago

e.g. design a terrible pathogen

biophysboy•1h ago

LLMs do not know the evolutionary fitness of pathogens for all possible genomes & environments. LLMs have not replaced experimental biology.

ancillary•1h ago

I'm not OP or a doomer, but I do worry about AI making tasks too achievable. Right now if a very angry but not particularly diligent or smart person wants to construct a small nuclear bomb and detonate it in a city center, there are so many obstacles to figuring out how to build it that they'll just give up, even though at least one book has been written (in the early 70s! The Curve of Binding Energy) arguing that it is doable by one or a very small group of committed people.

Given an (at this point still hypothetical, I think) AI that can accurately synthesize publicly available information without even needing to develop new ideas, and then break the whole process into discrete and simple steps, I think that protective friction is a lot less protective. And this argument applies to malware, spam, bioweapons, anything nasty that has so far required a fair amount of acquirable knowledge to do effectively.

aldousd666•1h ago

That same function could be fulfilled by better search engines though, even if they don't actually write a plan for you. I think you're right about it being more available now, and perhaps that is a bad thing. But you don't need AI for that, and it would happen anyway sooner or later even with just incremental increases in our ability to find information other humans have written. (Like a version of google books that didn't limit the view to a small preview, to use your specific example of a book where this info already exists)

gf000•51m ago

I get your point, but even whole ass countries routinely fail at developing nukes.

"Just" enrichment is so complicated and requires basically every tech and manufacturing knowledge humanity has created up until the mid 20th century that an evil idiot would be much better off with just a bunch of fireworks.

pegasus•48m ago

It might require that knowledge implicitly, in the tools and parts the evil idiot would use, but they presumably would procure these tools and parts, not invent or even manufacture them themselves.

worldsayshi•27m ago

It's very convenient that it is that hard.

medvezhenok•7m ago

Yeah, the interview with Geoffrey Hinton had a much better summary of risks. If we're talking about the bad actor model, biological weaponry is both easier to make and more likely as a threat vector than nuclear.

terminalshort•35m ago

Knowing how is very rarely the relevant obstacle. In the case of nuclear bombs the obstacles are, in order of easiest to hardest:

1. finding out how to build one

2. actually building the bomb once you have all the parts

3. obtaining (or building) the equipment needed to build it

4. obtaining the necessary quantity of fissionable material

5. not getting caught while doing 3 & 4

pegasus•1h ago

Not just any AI. AGI, or more precisely ASI (artificial super-intelligence), since it seems true AGI would necessarily imply ASI simply through technological scaling. It shouldn't be hard to come up with scenarios where an AI which can outfox us with ease would give us humans at the very least a few headaches.

drunner•1h ago

Never seen terminator?

Jokes aside, a true agi would displace literally every job over time. Once agi + robot exists, what is the purpose for people anymore. That's the doom, mass societal existentialism. Probably worse than if aliens landed on earth.

freemanindia•1h ago

Make money exploiting natural and human resources while abstracting perceived harms away from stakeholders. At scale.

tmountain•1h ago

Take 30 minutes and watch this:

https://www.youtube.com/watch?v=5KVDDfAkRgc

bargainbin•1h ago

It’s not AI itself that’s the bad part, it’s how the world reacts to white collar work being obliterated.

The wealth hasn’t even trickled down whilst we’ve been working, what’s going to happen when you can run a business with 24/7 autonomous computers?

olalonde•28m ago

They essentially extrapolate from what the most intelligent species on this planet did to the others.

goatlover•17m ago

Potentially wreck the economy by causing high unemployment while enabling the technofeudalists to take over governments. Even more doomer scenario is if they succeed in creating ASI without proper guardrails and we lose control over it. See the AI 2027 paper for that. Basically it paper clips the world with data centers.

margalabargala•8m ago

It won't lead us to an apocalypse apocalypse, but it may well lead us to an economic crisis.

shortrounddev2•1h ago

Maybe because they haven't created an engine for AGI, but a really really impressive bullshit generator.

babypuncher•1h ago

I would argue that this is because we are reaching the practical limits of this technology and AGI isn't nearly as close as people thought.

fdsjgfklsfd•1h ago

I think they're just reaching the limits of this architecture and when a new type is invented it will be a much bigger step.

hodgehog11•1h ago

Working in the theory, I can say this is incredibly unlikely. At scale, once appropriately trained, all architectures begin to converge in performance.

It's not architectures that matter anymore, it's unlocking new objectives and modalities that open another axis to scale on.

highfrequency•27m ago

Could you elaborate with a few more paragraphs? What do you mean by “working in the theory?”

viraptor•15m ago

Do we really have the data on this? I mean, it does happen on a smaller scale, but where's the 300B version of RWKV? Where's hybrid symbolic/LLM? Where are other experiments? I only see larger companies doing relatively small tweaks to the standard transformers, where the context size still explodes the memory use - they're not even addressing that part.

hodgehog11•1h ago

It's still not necessarily wrong, just unlikely. Once these developers start using the model to update itself, beyond an unknown threshold of capability, one model could start to skyrocket in performance above the rest. We're not in that phase yet, but judging from what the devs at the end were saying, we're getting uncomfortably (and irresponsibly) close.

koonsolo•1h ago

This confirms my suspicion that we are not at the exponential part of the curve, but the flattening one. It's easier to stay close to your competitors when everyone is at the flat curve of the innovation.

The improvements they make are marginal. How long until the next AI breakthrough? Who can tell? Because last time it took decenia.

chasd00•1h ago

I think the breakthroughs now will be the application of LLMs to the rest of the world. Discovering use cases where LLMs really shine and applying them while learning and sharing the use cases where they do not.

j_timberlake•1h ago

I think you're reading way too much into OpenAI bungling its 15-month product lead, but also the whole "1 AGI company will take off" prediction is bad anyway, because it assumes governments would just let that happen. Which they wouldn't, unless the company is really really sneaky or superintelligence happens in the blink of an eye.

jacquesm•1h ago

Governments react at a glacial pace to new technological developments. They wouldn't so much as 'let it happen' as that it had happened and they simply never noticed it until it was too late. If you are betting on the government having your back in this then I think you may end up disappointed.

aldousd666•1h ago

I think if any government really thought that someone was developing a rival within their borders they would send in the guys with guns and handle it forthwith.

jazzyjackson•58m ago

They would just declare it necessary for military purpose and demand the tech be licensed to a second company so that they have redundant sources, same as they did with AT&T's transistor.

jacquesm•41m ago

That was something that was tied to a bunch of very specific physical objects. There is a fair chance that once you get to the point where this thing really comes into being especially if it takes longer than a couple of hours for it to be shut down or contained that the genie will never ever be put back into the bottle again.

Note that 'bits' are a lot easier to move from one place to another than hardware. If invented at 9 am it could be on the other side of the globe before you're back from your coffee break at 9:15. This is not at all like almost all other trade secrets and industrial gear, it's software. Leaks are pretty much inevitable and once it is shown that it can be done it will be done in other places as well.

knodi123•1h ago

* or governments fail to look far enough ahead, due to a bunch of small-minded short-sighted greedy petty fools.

Seriously, our government just announced it's slashing half a billion dollars in vaccine research because "vaccines are deadly and ineffective", and it fired a chief statistician because the president didn't like the numbers he calculated, and it ordered the destruction of two expensive satellites because they can observe politically inconvenient climate change. THOSE are the people you are trusting to keep an eye on the pace of development inside of private, secretive AGI companies?

j_timberlake•1h ago

That's just it, governments won't "look ahead", they'll just panic when AGI is happening.

If you're wondering how they'll know it's happening, the USA has had DARPA monitoring stuff like this since before OpenAI existed.

highfrequency•59m ago

> OpenAI bungling its 15-month product lead

Do you mean from ChatGPT launch or o1 launch? Curious to get your take on how they bungled the lead and what they could have done differently to preserve it. Not having thought about it too much, it seems that with the combo of 1) massive hype required for fundraising, and 2) the fact that their product can be basically reverse engineered by training a model on its curated output, it would have been near impossible to maintain a large lead.

j_timberlake•22m ago

My 2 cents: ChatGPT -> Gemini 1 was their 15-month lead. The moment ChatGPT threatened Google's future Search revenue (which never actually took a hit afaik), Google reacted by merging Deepmind and Google Brain and kicked off the Gemini program (that's why they named it Gemini).

Basically, OpenAI poked a sleeping bear, then lost all their lead, and are now at risk of being mauled by the bear. My money would be on the bear, except I think the Pentagon is an even bigger sleeping bear, so that's where I would bet money (literally) if I could.

torginus•48m ago

I think OpenAI has committed hard onto the 'product company' path, and will have a tough time going back to interesting science experiments that may and may not work, but are necessary for progress.

logicchains•1h ago

>It's interesting to note that at least so far, the trend has been the opposite: as time goes on and the models get better, the performance of the different company's gets clustered closer together

It's natural if you extrapolate from training loss curves; a training process with continually diminishing returns to more training/data is generally not something that suddenly starts producing exponentially bigger improvements.

citizenpaul•1h ago

Even at the beginning of the year people were still going crazy over new model releases. Now the various model update pages are starting to average times in the months since their last update rather than days/weeks. This is across the board. Not limited to a single model.

jama211•1h ago

Well said. It’s clearly plateauing. It could be a localised plateau, or something more fundamental. Time will tell.

rvnx•48m ago

It's a very long presentation just to say that GPT-5 is slightly improved compared to GPT-4o

FiniteIntegral•1h ago

I think part of this is due to the AI craze no longer being in the wildest west possible. Investors, or at least heads of companies believe in this as a viable economic engine so they are properly investing in what's there. Or at least, the hype hasn't slapped them in the face just yet.

mirekrusin•1h ago

Diversity where new model release takes the crown until next release is healthy. Shame only US companies seem to be doing it, hopefully this will change as the rest is not far off.

netcan•1h ago

Its certainly an interesting race to watch.

Part of the fun is that predictions get tested on short enough timescales to "experience" in a satisfying way.

Idk where that puts me, in my guess at "hard takeoff." I was reserved/skeptical about hard takeoff all along.

Even if LLMs had improved at a faster rate... I still think bottlenecks are inevitable.

That said... I do expect progress to happen in spurts anyway. It makes sense that companies of similar competence and resources get to a similar place.

The winner take all thing is a little forced. "Race to singularity" is the fun, rhetorical version of the investment case. The implied boring case is facebook, adwords, aws, apple, msft... IE the modern tech sector tends to create singular big winners... and therefore our pre-revenue market cap should be $1trn.

klik99•1h ago

I’ve been saying for a while if AGI is possible it’s going to take another innovation and the transformer / LLM paradigm will plateau, and innovations are hard to time. I used to get downvoted for saying that years ago and now more people are realizing it. LLMs are awesome but there is a limit, most of the interesting things in the next years will be bolting more functionality and agent stuff, introspection like Anthropic is working on and smaller, less compute hungry specialized models. There’s still a lot to explore in this paradigm, but we’re getting diminishing returns on newer models, especially when you factor in cost

BizarroLand•35m ago

I bet that it will only happen when the ability to process and concrete new information into its training model without retraining the entire model is standard, AND when multiple AIs with slightly different datasets are set to work together to create a consensus response approach.

It's probably never going to work with a single process without consuming the resources of the entire planet to run that process on.

ricardobayes•1h ago

AGI in 5/10 years is similar to "we won't have steering wheels in cars" or "we'll be asleep driving" in 5/10 years. Remember that? What happened to that? It looked so promising.

RealityVoid•1h ago

I mean, in certain US cities you can take a waymo right now. It seems that adage where we overestimate change in the short term and underestimate change in the long term fits right in here.

ricardobayes•1h ago

Of course. My point being "AI is going to take dev jobs" is very much like saying "Self driving will take taxi driver jobs". Never happened and likely won't happen or on a very, very long time scale.

asadotzler•1h ago

That's not us though. That's a third party worth trillions of dollars that manages a tiny fleet of robot cars with a huge back-end staff and infrastructure, and only in a few cities covering only about 2-3% of us (in this one country.) We don't have steering wheel-less cars and we can't/shouldn't sleep on our commute to and from work.

jjk166•25m ago

I don't think anyone was ever arguing "not only are we going to develop self driving technology but we're going to build out the factories to mass produce self driving cars, and convince all the regulatory bodies to permit these cars, and phase out all the non-self driving vehicles already on the road, and do this all at a price point equal or less than current vehicles" in 5 to 10 years. "We will have self driving cars in 10 years" was always said in the same way "We will go to the moon in 10 years" was said in the early 60s.

ralfd•1h ago

> "we'll be asleep driving" in 5/10 years. Remember that? What happened to that?

https://www.youtube.com/shorts/dLCEUSXVKAA

nerdix•1h ago

Not only do I think there will not be a winner take all, I think it's very likely that the entire thing will be commoditized.

I think it's likely that we will eventually we hit a point of diminishing returns where the performance is good enough and marginal performance improvements aren't worth the high cost.

And over time, many models will reach "good enough" levels of performance including models that are open weight. And given even more time, these open weight models will be runnable on consumer level hardware. Eventually, they'll be runnable on super cheap consumer hardware (something more akin to a NPU than a $2000 RTX 5090). So your laptop in 2035 with specialize AI cores and 1TB of LPDDR10 ram is running GPT-7 level models without breaking a sweat. Maybe GPT-10 can solve some obscure math problem that your model can't but does it even matter? Would you pay for GPT-10 when running a GPT-7 level model does everything you need and is practically free?

The cloud providers will make money because there will still be a need for companies to host the models in a secure and reliable way. But a company whose main business strategy is developing the model? I'm not sure they will last without finding another way to add value.

joelthelion•1h ago

> Not only do I think there will not be a winner take all, I think it's very likely that the entire thing will be commoditized

This begs the question, why then do AI companies have these insane valuations? Do investorsknow something that we don't?

jdlshore•1h ago

Investors are often irrational in the short term. Personally, I think it’s a combination of FOMO, wishful thinking, and herd following.

j_timberlake•53m ago

"Billionaire investors are more irrational than me, a social media poster."

hoanamiu•45m ago

"Having money is proof of intelligence"

gf000•43m ago

I mean, why do you think they have any idea on how a completely new thing will turn out?

They are speculating. If they are any good, then they do it with an acceptable risk profile.

42lux•1h ago

If one achieves AGI and releases it everyone has AGI...

aydyn•1h ago

I think this is simply due to the fact that to train an AGI-level AI currently requires almost grid scale amounts of compute. So the current limitation is purely physical hardware. No matter how intelligent GPT-5 is, it can't conjure extra compute out of thin air.

I think you'll see the prophesized exponentiation once AI can start training itself at reasonable scale. Right now its not possible.

general1726•1h ago

Because they are hitting Compute Efficient Frontier. Models can't be much bigger, there is no more original data on the internet, so all models will eventually cluster to similar CEF as was described in this video 10 months ago

https://www.youtube.com/watch?v=5eqRuVp65eY

lasc4r•1h ago

These companies seem to think AGI will come from better LLMs, seems more like an AGI dead end that's plateaued to me.

felineflock•56m ago

Plot twist - once GPT reached AGI, this is exactly the strategy chosen for self-preservation. Appear to not lead by too much, only enough to make everyone think we're in a close race, play dumb when needed.

Meanwhile, keep all relevant preparations in secret...

jjk166•32m ago

“If the humans see me actually doing my job, it helps keep suspicions from forming about faulty governor modules.”

lqstuart•56m ago

It’s frequently suggested by people with no background and/or a huge financial stake in the field

lamontcg•50m ago

> they can all basically solve moderately challenging math and coding problems

Yesterday, Claude Opus 4.1 failed in trying to figure out that `-(1-alpha)` or `-1+alpha` is the same as `alpha-1`.

We are still a little bit away from AGI.

jjk166•49m ago

Looks like a lot of players getting closer and closer to an asymptotic limit. Initially small changes lead to big improvements causing a firm to race ahead, as they go forward performance gains from innovation become both more marginal and harder to find, nonetheless keep. I would expect them all to eventually reach the same point where they are squeezing the most possible out of an AI under the current paradigm, barring a paradigm shifting discovery before that asymptote is reached.

mizzao•37m ago

I recently wrote a little post about this exact idea: https://parsnip.substack.com/p/models-arent-moats

petralithic•36m ago

It's the classic S-curve. A few years ago when we saw ChatGPT come out, we got started on the ramping up part of the curve but now we're on the slowing down part. That's just how technology goes in general.

jboggan•17m ago

We are not approaching the Singularity but an Asymptote

TheoGone•31m ago

LLMs are good at mimicking human intuition. Still sucks at deep thinking.

LLMs PATTERN MATCH well. Good at "fast" System 1 thinking, instantly generating intuitive, fluent responses.

LLMs are good at mimicking logic, not real reasoning. Simulate "slow," deliberate System 2 thinking when prompted to work step-by-step.

The core of an LLM is not understanding but just predicting the next most word in a sequence.

LLMs are good at both associative brainstorming (System 1) and creating works within a defined structure, like a poem (System 2).

Reasoning is the Achilles heel rn. AN LLM's logic can SEEM plausible, it's based on CORRELATION, NOT deductive reasoning.

brk•14m ago

AGI is so far away from happening that it is barely worth discussing at this stage.

baxtr•8m ago

I have been saying this before: S-curves look a lot like exponential curves in the beginning.

Thus it’s easy to mistake one for the other - at least initially.

Sajarin•2h ago

What did Ilya see? (or rather what could he no longer bear to see?)

> Academics distorting graphs to make their benchmarks appear more impressive

> lavish 1.5 million dollar bonuses for everyone at the company

> Releasing an open source model that doesn't even use latent multi head attention in a open source AI world led by Chinese labs

> Constantly overhyping models as scary and dangerous to buy time to lobby against competitors and delay product launches

> Failing to match that hype as AGI is not yet here

hrpnk•2h ago

They will retire lots of models: GPT-4o, GPT-4.1, GPT-4.5, GPT-4.1-mini, o4-mini, o4-mini-high, o3, o3-pro.

https://help.openai.com/en/articles/6825453-chatgpt-release-...

"If you open a conversation that used one of these models, ChatGPT will automatically switch it to the closest GPT-5 equivalent."

- 4o, 4.1, 4.5, 4.1-mini, o4-mini, or o4-mini-high => GPT-5

- o3 => GPT-5-Thinking

- o3-Pro => GPT-5-Pro

pradn•2h ago

Finally, someone from the product side got a word in. Keep it simple!

hobofan•31m ago

Keeping it simple in that regard will just drive even more enterprise users into the arms of Microsoft.

atonse•1h ago

Smart way to probably also free up resources that are currently fragmented running those older models. They could all run the latest model and have more capacity.

ComputerGuru•1h ago

API usage is not affected by this.

hrpnk•40m ago

I guess deprecation on API side is coming some time soon as well

baobabKoodaa•18m ago

GPT-5-nano does not support temperature parameter and is giving me worse quality results than GPT-4.1-nano. Will be interesting if they truly do end up retiring a better model in favor of a worse one.

nikanj•5m ago

"GPT-4o, GPT-4.1, GPT-4.5, GPT-4.1-mini, o4-mini, o4-mini-high, o3, o3-pro"

The names of GPT models are just terrible. o3 is better than 4o, maybe?

AtNightWeCode•2h ago

So OpenAI added withpersona mandatory for API access. Thank you and goodbye.

simonw•2h ago

I had preview access for a couple of weeks. I've written up my initial notes so far, focusing on core model characteristics, pricing (extremely competitive) and lessons from the model card (aka as little hype as possible): https://simonwillison.net/2025/Aug/7/gpt-5/

dang•1h ago

Related ongoing thread:

GPT-5: Key characteristics, pricing and model card - https://news.ycombinator.com/item?id=44827794

jaccola•1h ago

Out of interest, how much does the model change (if at all) over those 2 weeks? Does OpenAI guarantee that if you do testing from date X, that is the model (and accompaniments) that will actually be released?

I know these companies do "shadow" updates continuously anyway so maybe it is meaningless but would be super interesting to know, nonetheless!

simonw•26m ago

It changed quite a bit - we got new model IDs to test every few days. They did tell us when the model was "frozen", and I ran my final tests against those IDs.

OpenAI and Anthropic don't update models without changing their IDs, at least for model IDs with a date in them.

OpenAI do provide some aliases, and their gpt-5-chat-latest and chatgpt-4o-latest model IDs can change without warning, but anything with a date in (like gpt-5-2025-08-07) stays stable.

BryantD•1h ago

In the interests of gathering these pre-release impressions, here's Ethan Mollick's writeup: https://www.oneusefulthing.org/p/gpt-5-it-just-does-stuff

Thank you to Simon; your notes are exactly what I was hoping for.

candiddevmike•1h ago

This post seems far more marketing-y than your previous posts, which have a bit more criticality to them (such as your Gemini 2.5 blog post here: https://simonwillison.net/2025/Jun/17/gemini-2-5/). You seem to gloss over a lot of GPT-5's shortcomings and spend more time hyping it than other posts. Is there some kind of conflict of interest happening?

camgunz•1h ago

From the guidelines: Please don't post insinuations about astroturfing, shilling, brigading, foreign agents, and the like. It degrades discussion and is usually mistaken. If you're worried about abuse, email hn@ycombinator.com and we'll look at the data.

mhh__•1h ago

I don't think that this applies to commenting on someone's blog.

simonw•27m ago

Yeah this criticism was pretty mild, I don't think it violates that HN guideline personally.

dcreater•1h ago

Yes I noticed the same. This is very concerning

drewbitt•1h ago

I am seeing the conflict from other tech influencers who were given early access or even invited to OpenAI events pre-release.

simonw•34m ago

I was invited to the OpenAI event pre-release too - here's my post about that: https://simonwillison.net/2025/Aug/7/previewing-gpt-5/

yahoozoo•1h ago

Like many other industries, you probably lose preview access if you are negative.

HAL3000•1h ago

Maybe there is a misconception about what his blog is about. You should treat it more like a YouTuber reporting, not an expert evaluation, more like an enthusiast testing different models and reiterating some points about them, but not giving the opinions of an expert or ML professional. His comment history on this topic in this forum clearly shows this.

It’s reasonable that he might be a little hyped about things because of his feelings about them and the methodology he uses to evaluate models. I assume good faith, as the HN guidelines propose, and this is the strongest plausible interpretation of what I see in his blog.

simonw•29m ago

I consider myself an expert in the field of LLMs, and I try to write in a way that supports that.

blackhaj7•24m ago

If Simon isn't an expert then I am not sure who is

simonw•30m ago

You really think so? My goal with this post was to provide the non-hype commentary - hence my focus on model characteristics, pricing and interesting notes from the system card.

I called out the prompt injection section as "pretty weak sauce in my opinion".

I did actually have a negative piece of commentary in there about how you couldn't see the thinking traces in the API... but then I found out I had made a mistake about that and had to mostly remove that section! Here's the original (incorrect) text from that: https://gist.github.com/simonw/eedbee724cb2e66f0cddd2728686f... - and the corrected update: https://simonwillison.net/2025/Aug/7/gpt-5/#thinking-traces-...

The reason there's not much negative commentary in the post is that I genuinely think this model is really good. It's my favorite model right now. The moment that changes (I have high hopes for Claude 5 and Gemini 3) I'll write about it.

nilsherzig•1h ago

> In my own usage I’ve not spotted a single hallucination yet

Did you ask it to format the table a couple paragraphs above this claim after writing about hallucinations? Because I would classify the sorting mistake as one

simonw•23m ago

That wasn't a hallucination, that was it failing to sort things correctly.

nzach•2h ago

One interesting thing I noticed in these "fixing bugs" demos is that people don't seem to resolve the bugs "traditionally" before showing off the capabilities of this new model.

I would like to see a demo where they go through the bug, explain what are the tricky parts and show how this new model handle these situations.

Every demo I've seen seems just the equivalent of "looks good to me" comment in a merge request.

vagab0nd•2h ago

This is the inverse of the "$2000/mo tier", and I'm kind of disappointed TBH.

jumploops•2h ago

Is GPT-5 using a new pretrained base, or is it the same as GPT-4.1?

Given the low cost of GPT-5, compared to the prices we saw with GPT-4.5, my hunch is that this new model is actually just a bunch of RL on top of their existing models + automatic switching between reasoning/non-reasoning.

ulrischa•2h ago

Not yet available in Germany

asgr•2h ago

"Perhaps it is not possible to simulate higher-level intelligence using a stochastic model for predicting text." - beeflet

up6w6•2h ago

crazy how they only show benchmark results against their own models

nicetryguy•2h ago

Very generic, broad and bland presentation. Doesn't seem to have any killer features. No video or audio capabilities shown. The coding seems to be on par with Claude 3.7 at best. No mention of MCP which is about the most important thing in AI right now IMO. Not impressed.

alvis•1h ago

It's hidden in the doc. It MCP support!!! has https://platform.openai.com/docs/models/gpt-5

sjapkee•2h ago

Based on benchmarks it's a flop. Not unexpected tho after oss

theanonymousone•2h ago

Are they reducing the price of older models now?

defraudbah•2h ago

i love how the guys are pretending to be listening everyone's speach for the first time, like they don't know how it works.. marketing is weird

anthk•2h ago

386-486-Pentium. At first we got FDIV and F00F.

Something similar with this might happen, an underlying curse hidden inside an apparenting ground-breaking desigb.

daveguy•2h ago

I would love to see how this performs on ARC-AGI 2, zero-shot, private eval. I hope we get an update from Chollet and team regarding performance.

danbtl•1h ago

9.9% on ARC-AGI 2

https://x.com/fchollet/status/1953511631054680085

daveguy•1h ago

Hah, that was fast! Thank you. They must have had preview access. It didn't bode well that SimonW [0] had to explicitly tell GPT-5 to use python to get a table sorted correctly (but awesome that in can use python as a tool without any plumbing). It appears we are not quite to AGI yet.

[0] https://simonwillison.net/2025/Aug/7/gpt-5/

pelorat•1h ago

Absolutely nothing new or groundbreaking. It's just a more tuned version of a basic LLM architecture.

mikewarot•1h ago

I've you're into woo-woo physics, GPT-5 seems to have a good handle on things.. here's a chat I just had with it.[1]

[1] https://chatgpt.com/s/t_6894f13b58788191ada3fe9567c66ed5

jwpapi•1h ago

So it sucks?

submeta•1h ago

I don’t see GPT-5 in the model selection. What am I missing?

henriquegodoy•1h ago

That SWE-bench chart with the mismatched bars (52.8% somehow appearing larger than 69.1%) was emblematic of the entire presentation - rushed and underwhelming. It's the kind of error that would get flagged in any internal review, yet here it is in a billion-dollar product launch. Combined with the Bernoulli effect demo confidently explaining how airplane wings work incorrectly (the equal transit time fallacy that NASA explicitly debunks), it doesn't inspire confidence in either the model's capabilities or OpenAI's quality control.

The actual benchmark improvements are marginal at best - we're talking single-digit percentage gains over o3 on most metrics, which hardly justifies a major version bump. What we're seeing looks more like the plateau of an S-curve than a breakthrough. The pricing is competitive ($1.25/1M input tokens vs Claude's $15), but that's about optimization and economics, not the fundamental leap forward that "GPT-5" implies. Even their "unified system" turns out to be multiple models with a router, essentially admitting that the end-to-end training approach has hit diminishing returns.

The irony is that while OpenAI maintains their secretive culture (remember when they claimed o1 used tree search instead of RL?), their competitors are catching up or surpassing them. Claude has been consistently better for coding tasks, Gemini 2.5 Pro has more recent training data, and everyone seems to be converging on similar performance levels. This launch feels less like a victory lap and more like OpenAI trying to maintain relevance while the rest of the field has caught up. Looking forward to seeing what Gemini 3.0 brings to the table.

rrrrrrrrrrrryan•1h ago

I suspect the vast majority of OpenAI's users are only using ChatGPT, and the vast majority of those ChatGPT users are only using the free tier.

For all of them, getting access to full-blown GPT-5 will probably be mind-blowing, even if it's severely rate-limited. OpenAI's previous/current generation of models haven't really been ergonomic enough (with the clunky model pickers) to be fully appreciated by less tech-savvy users, and its full capabilities have been behind a paywall.

I think that's why they're making this launch is a big deal. It's just an incremental upgrade for the power users and the people that are paying money, but it'll be a step-change in capability to everyone else.

z7•1h ago

>The actual benchmark improvements are marginal at best

GPT-5 demonstrates exponential growth in task completion times:

https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...

hk__2•22m ago

What do you mean? A single data point cannot be exponential. What the blog post say is that the ability to solve tasks of all LLMs is exponential over time, and GPT-5 fits in that curve.

kkukshtel•50m ago

You're sort of glossing over the part where this can now be leveraged as a cost-efficient agentic model that performs better than o3. Nobody used o3 for sw agent tasks due to costs and speed, and this now substantially seems to both improve on o3 AND be significantly cheaper than Claude.

synapsomorphy•42m ago

o3's cost was sliced by 80% a month or so ago and is also cheaper than Claude (the output is even cheaper than GPT-5). It seems more cost efficient but not by much.

BoorishBears•7m ago

This feels revisionist: no one used it because it wasn't as good.

IceDane•47m ago

The whole presentation was full of completely broken bar charts. Not even just the typical "let's show 10% of the y axis so that a 5% increase looks like 5x" but stuff like the deception eval showing gpt5 vs o3 as 50 vs 47, but the 47 is 3x as big, and then right next to it we have 9 vs 87, more reasonably sized.

It's like no one looked at the charts, ever, and they just came straight from.. gpt2? I don't think even gpt3 would have fucked that up.

I don't know any of those people, but everyone that has been with OAI for longer than 2 years 1.5m bonuses, and somehow they can't deliver a bar chart with sensible at axes?

entropyneur•1h ago

This was the first product demo I've watched in my entire life. Not because I am excited for the new tech, but because I'm anxious to know if I'm already being put out of my job. Not this time, it seems.

hamza__nouali•1h ago

it's already available on Cursor but not on ChatGPT

themafia•1h ago

So part of your internship at OpenAI is astroturfing social media?

perdomon•1h ago

I've enabled GPT-5 in Copilot settings in the browser, but it's not showing up in VS Code. Anyone seeing it in VS Code yet?

pseudosavant•1h ago

That was my first thought - when do I get it in Copilot in VS Code? That is the place I consume the most tokens.

pseudosavant•1h ago

This is what their blog post says: `GPT-5 will be rolling out to all paid Copilot plans, starting today. You will be able to access the model in GitHub Copilot Chat on github.com, Visual Studio Code (Agent, Ask, and Edit modes), and GitHub Mobile through the chat model picker. Continue to check back if you’ve not gotten access.`

I think "starting today" might be doing some heavy lifting in that sentence.

https://github.blog/changelog/2025-08-07-openai-gpt-5-is-now...

personalityson•1h ago

So, where is it?

pphysch•1h ago

Seems like we're in the endgame for OpenAI and hence the AI bubble. Nothing mind-blowing, just incremental changes.

They've topped and are looking to cash out:

https://www.reuters.com/business/openai-eyes-500-billion-val...

alvis•1h ago

MCP support has landed in gpt-5 but the video has no mention at all! https://platform.openai.com/docs/models/gpt-5

dz0707•1h ago

I did a little test that I like to do with new models: "I have rectangular space of dimensions 30x30x90mm. Would 36x14x60mm battery fit in it, show in drawing proof". GPT5 failed spectacularly.

lifty•1h ago

It seems to me that there’s no way to achieve AGI with the current LLM approach. New releases have small improvements, live we’re hitting some kind of plateau. And I say this a a heavy LLM user. Don’t fire your employees just yet.

hodgehog11•1h ago

Are others currently able to use GPT-5 yet? It doesn't seem to be available on my account, despite the messaging.

m4houk•1h ago

It's already available in Cursor for me (on the Ultra plan).

hodgehog11•1h ago

Interesting, the partners might be giving out support faster than OpenAI is to their own users.

suyash•1h ago

Is this US only release as I'm not seeing it in the UK ?

m4nu3l•1h ago

Very funny. The very first answer it gave to illustrate its "Expert knowledge" is quite common, and it's wrong. What's even funnier is that you can find why on Wikipedia: https://en.wikipedia.org/wiki/Lift_(force)#False_explanation... What's terminally funny is that in the visualisation app, it used a symmetric wing, which of course wouldn't generate lift according to its own explanation (as the travelled distance and hence air flow speed would be the same). I work as a game physics programmer, so I noticed that immediately and almost laughed. I watched only that part so far while I was still at the office, though.

XCSme•38m ago

AGI

phkahler•28m ago

A symmetric wing will not produce lift a zero angle of attack. But tilted up it will. The distance over the top will also increase, as measured from the point where the surface is perpendicular to the velocity vector.

That said, yeah the equal time thing never made any sense.

m4nu3l•24m ago

Of course, I'm just pointing out that the main explanation it gave was the equal transit time and added the angle of attack only "slightly increases lift", which quite clashes with the visualisation IMO.

ipozgaj•1h ago

Tech aside (covered well by other commenters), the presentation itself was incredibly dry. Such a stark difference in presenting style here compared to, for example, Apple's or Google's keynotes. They should really put more effort into it.

jama211•1h ago

Is it just me or has there not been a significant improvement in these models in the last 6 months - from the perspective of the average user. I mean, the last few years has seen INSANE improvement, but it really feels like it’s been slowing and plateauing for a while now…

system2•1h ago

I have GPT Plus, but I cannot get GPT5 even if I click the suggested link in the article. Anyone experiencing it?

ftkftk•1h ago

Answer in one word: Underwhelming.

Bad data on graphs, demos that would have been impressive a year ago, vibe coding the easiest requests (financial dashboard), running out of talking points while cursor is looping on a bug, marginal benchmark improvements. At least the models are kind of cheaper to run.

MagicMoonlight•1h ago

It's pretty good. I asked it to make a piece of warehouse software for storing cobs of corn and it instantly pumped out a prototype. I didn't ask it for anything in particular but it included JSON importing and exporting and all kinds of stuff.

It's going to be absolute chaos. Compsci was already mostly a meme, with people not able to program getting the degree. Now we're going to have generations of people that can't program at all, getting jobs at google.

If you can actually program, you're going to be considered a genius in our new idiocracy world. "But chatgpt said it should work, and chatgpt has what people need"

andsoitis•1h ago

> Knowledge cut-off is September 30th 2024 for GPT-5 and May 30th 2024 for GPT-5 mini and nano.

That lag! Are humans (training) the bottleneck?

bli940505•1h ago

I have the pro plan but don't seem to have access to it?

energy123•59m ago

Decisive #1 on lmarena. Large context. Low hallucinations. Very cheap API.

It's slightly better than what I was expecting.

hansmayer•58m ago

Meh. For all the hype over the last several weeks, I'd had expected at least a programming demo that would blow even us skeptics off our feet. The folks presenting were giving off an odd vibe too. Somehow it all just looked, pre-trained :), shall we say? No energy or enthusiasm. Hell I'd even take the Bill Gates' and Steve Balmer's Win95 launch dance over this very dull and "safe" presentation.

kgwgk•57m ago

I was told there would be a whale.

mvieira38•57m ago

Codex was straight-up left out of the material while they invited the CEO of Cursor and used Cursor for all agentic demonstrations. Weird

hbn•56m ago

> An expressive writing partner

> emdash 3 words into their highlighted example

gavmor•25m ago

I've always utilized emdashes heavily, and now they're suddenly passe—an unmourned casualty of the new paradigm.

TheAlchemist•54m ago

So, would a layman notice the difference between GPT4 and GPT5 ?

Like a Turing test but between the models.

kaindume•53m ago

My 2 cents

There would be no GPT without Google, no Google without the WWW, no WWW without TCP/IP. This is why I believe calling it "AI" is a mistake or just for marketing, we should call all of them GPTs or search engines 2.0. This is the natural next step after you have indexed most of the web and collected most of the data.

Also there would be no coding agents without Free Software and Open-Source.

Phui3ferubus•51m ago

Top 3 links in HN frontpage are all about GPT-5. I don't remember when was the last time people were so excited about something.

skywalkerr98•51m ago

so claude is doing so much thing before gpt 5 it's like a samsung vs iphone :D

JonChesterfield•34m ago

Anyone have an explanation for openai announcing their newest bestest replace all the others AI with slides of such embarrassing incompetence that most of this discussion is mocking them?

I've got nothing. Cannot see how it helps openai to look incompetent while trying to raise money.

DrSiemer•28m ago

Wish they would stop mentioning AGI. It's like the creator of a new car claiming it's a step closer to teleportation.

boombapoom•28m ago

someone should make an agentic node dependency manager... PLEASE

andix•26m ago

GPT-5 just dropped for my ChatGPT Plus.

Two concerning things: - thinking/non-thinking is still not really unified, you can chose and the non-thinking version still doesn't start thinking on tasks that could obviously get better results with thinking

- all the older models are gone! No 4o, 4.1, 4.5, o3 available anymore

markb139•25m ago

Ha. I asked it to write some code for the Raspberry Pi RP2350. It told me there might be some confusion as there is no official product release of the RP2350. If it doesn’t know that, then what else doesn’t it know?

AgentMatrixAI•20m ago

I'm not really convinced, the benchmark blunder was really strange but the demos were quite underwhelming, and it appears this was reflected by a huge market correction in the betting markets as to who will have the best AI by end of the year.

What excites me now is that Gemini 3.0 or some answer from Google is coming soon and that will be the one I will actually end up using. It seems like the last mover in the LLM race is more advantageous.

wrcwill•19m ago

ugh still fails my test prompt: https://chatgpt.com/share/689507c7-5394-8009-b836-c6281a246e...

"Assume the earth was just an ocean and you could travel by boat to any location. Your goal is to always stay in the sunlight, perpetually. Find the best strategy to keep your max speed as low as possible"

o3 pro gets it right though..

Razengan•8m ago

It says "Try it in ChatGPT" and I'm a paying user, but I don't see an option to use it? Is it a staggered rollout across different countries?

emil_priver•7m ago

I asked chat.com, lovable.dev and v0.dev to create a clone of https://www.inet.se/, an e-commerce website in Sweden selling computer parts. I am not impressed and I wasted $25 to test lovable :D

My prompt was "I want you to do an exact copy of https://inet.se using NextJS, tailwind. You can mock the data. I want you to create: 1. archive page: https://www.inet.se/kategori/263/mus 2. single product: https://www.inet.se/produkt/6103357/logitech-pro-x-superligh... 3. homepage: https://inet.se". Result: ChatGPT: https://chatgpt.com/share/68950397-a42c-8004-8efd-773794131c... Lovable: https://inet-clone-spark.lovable.app/ -- unable to share the prompt UI as I don't want you to make prompts on my account. v0: https://v0.dev/chat/inet-se-clone-project-j2m4OQpqWt5 https://v0-inet-se-clone-project.vercel.app/ Replit: https://imgur.com/a/DwPojtS -- I give you an imgur as I don't want to pay Replit for deploying a website.

The reason why I also tried lovable, v0 and Replit is that they all give better context as the developers provided additional context to my prompts and all of them use gpt-5. I asked it to clone a website as this is a simple task new developers do to learn.

I also asked Codex to help me fix a bug where an integration test was failing due to me, on purpose for this test, removing the code which sends an event to the queue. I provided the related files which contain http handlers, database files, code used to send messages to queue and the integration test. I also provided the full log for the integration test and what failed without any success in fixing the test :)

These were 2 use-cases they showcased in the demo which I wanted to try out. The result matters a lot to the context and what data I provide to it still.

I tried to act as a junior/mid-level engineer when it came to fixing the integration test and a person without programming skills when I generated the website clone. This might be a stupid test, especially for the clone-a-website test, but I wanted to try to create a situation as close to real as possible of a person who wanted to create a website for their service.

punee94•6m ago

I ran the below prompt to both Kimi2 and GPT5.

how many rs in cranberry?

-- GPT5's response: The word cranberry has two “r”s. One in cran and one in berry.

Kimi2's response: There are three letter rs in the word "cranberry".

GPT-5

GPT-5: Key characteristics, pricing and system card

Historical Tech Tree

Benchmark Framework Desktop Mainboard and 4-node cluster

GPT-5 for Developers

Building Bluesky comments for my blog

Encryption made for police and military radios may be easily cracked

Show HN: Octofriend, a cute coding agent that can swap between GPT-5 and Claude

DNA tests are uncovering the true prevalence of incest (2024)

Windows XP Professional

Infinite Pixels

How to sell if your user is not the buyer

Foundry (YC F24) is hiring staff-level product engineers

Lightweight LSAT

Open music foundation models for full-song generation

OpenAI's new open-source model is basically Phi-5

Show HN: Browser AI agent platform designed for reliability

Gemini CLI GitHub Actions

How AI conquered the US economy: A visual FAQ

Laptop Support and Usability (LSU): July 2025 Report

A generic non-invasive neuromotor interface for human-computer interaction

Monte Carlo Crash Course: Quasi-Monte Carlo

Italy's pizza detectives

Jepsen: Capela dda5892

Leonardo Chiariglione: “I closed MPEG on 2 June 2020”

The Sunlight Budget of Earth

Zero-day flaws in authentication, identity, authorization in HashiCorp Vault

Preventing ZIP parser confusion attacks on Python package installers

Arm desktop: emulation

Lithium compound can reverse Alzheimer’s in mice: study

GPT-5

GPT-5: Key characteristics, pricing and system card

Historical Tech Tree

Benchmark Framework Desktop Mainboard and 4-node cluster

GPT-5 for Developers

Building Bluesky comments for my blog

Encryption made for police and military radios may be easily cracked

Show HN: Octofriend, a cute coding agent that can swap between GPT-5 and Claude

DNA tests are uncovering the true prevalence of incest (2024)

Windows XP Professional

Infinite Pixels

How to sell if your user is not the buyer

Foundry (YC F24) is hiring staff-level product engineers

Lightweight LSAT

Open music foundation models for full-song generation

OpenAI's new open-source model is basically Phi-5

Show HN: Browser AI agent platform designed for reliability

Gemini CLI GitHub Actions

How AI conquered the US economy: A visual FAQ

Laptop Support and Usability (LSU): July 2025 Report

A generic non-invasive neuromotor interface for human-computer interaction

Monte Carlo Crash Course: Quasi-Monte Carlo

Italy's pizza detectives

Jepsen: Capela dda5892

Leonardo Chiariglione: “I closed MPEG on 2 June 2020”

The Sunlight Budget of Earth

Zero-day flaws in authentication, identity, authorization in HashiCorp Vault

Preventing ZIP parser confusion attacks on Python package installers

Arm desktop: emulation

Lithium compound can reverse Alzheimer’s in mice: study

GPT-5

Comments