* Meeting notes which was read accurately from a handwritten note (impressive!) but the summary hallucinated information that was completely made up. * Running omplex pytorch benchmarks while getting the simple parts of it completely wrong. We're talking getting variants of y=f(wx+b), which is what was being compared. All the graphs and visualizations look very convincing, but the details of what's tested completely bonkers.
Is there a petition to bring o3 back? Please? At least it was obvious when it failed.
Womp womp. Frustrating.
Then ‘roll back’ to the real version - but only for paid users.
Imagine how much worse it’d have gone if they called it GPT-4o lite and gave that to free users only and kept 4o for paid only.
Maybe it will make more people subscribe?
But it will make people cancel their subs too - I miss o3
The same issue exists with a bunch of other types of image output from ChatGPT - graphs, schematics, organizational charts, etc. It's been getting better at generating images which look like the type of image you requested, but the accuracy of the contents hasn't kept up.
Reporters should. Or else they're not doing their jobs.
ChatGPT's image generation was not introduced as part of the GPT-5 model release (except SVG generation).
The article leads with "The latest ChatGPT [...] can’t even label a map".
Yes, ChatGPT's image gen has uncanny valley issues, but OpenAI's GPT-5 product release post says nothing about image generation, it only mentions analysis [1].
As far as I can tell, GPT-Image-1 [2], which was released around March, is what powers image generation used by ChatGPT, which they introduced as "4o Image Generation" [3], which suggests to me that GPT-Image-1 is a version of the old GPT-4o.
The GPT-5 System card also only mentions image analysis, not generation. [4]
In the OpenAI live stream they said as much. CNN could have checked and made it clear the features are from the much earlier release, but instead they lead with a misleading headline.
It's very true that OpenAI doesn't make it obvious how the image generation works though.
[1] https://openai.com/index/introducing-gpt-5/
[2] https://platform.openai.com/docs/models/gpt-image-1
[3] https://openai.com/index/introducing-4o-image-generation/
[4] https://cdn.openai.com/pdf/8124a3ce-ab78-4f06-96eb-49ea29ffb...
As an aside, ChatGPT has always been "overconfident" in the capabilities of its associated image model. It'll frequently offer to generate images which exceed its ability to execute, or which would need to be based on information which it doesn't know. Perhaps OpenAI developers need to place more emphasis on knowing when to refuse unrealistic image generation requests?
After gpt-image-1 has produced an image is another helpful intervention point, it can do a better self-review for detecting problems after image generation, but it's still not very thorough. However OpenAI has small teams, they try to keep them small and really focused, and everything is always changing really fast, they probably have gpt-image-2 or something else soon anyway.
In a way, reliable prediction is the main job OpenAI has to solve, and always has been. Some researches say the main way models are trained causes "Entangled Representations", which makes them unreliable. They also suffer from the "Reverse Curse". Maybe when they fix these issues, it might be real AGI and ASI all in one go?
I think their principal mistake was in conflating the introduction of GPT-5 with the model-selection heuristics they started using at the same time. Whatever empirical hacks they came up with to determine how much thinking should be applied to a given prompt are not working well. Then there's the immediate-but-not-really deprecation of the other models. It should have been very clear that the image-based tests that the CNN reporter referred to were not running on GPT-5 at all. But it wasn't, and that's a big marketing communications failure on OpenAI's part.
One of several, for anyone who sat through their presentation.
I done a couple of experiments now and I can get an LLM to make not horrible and mostly functional code with effort. (I’ve been trying to create algorithms from CS papers that don’t link to code) I’ve observed once you discover the magic words the LLM wants and give sufficient background in the history, it can do ok.
But, for me anyway, the process of uncovering the magic words is slower than just writing the code myself. Although that could be because I’m targeting toy examples that aren’t very large code bases and aren’t what is in the typical internet coding demo.
The limitations of what was believed to be by many as a path to AGI/ASI are becoming more clearly apparent.
Difficult to say how much room for improvement there is, or to have a definite answer regarding the usefulness and economic impact of those models, but what we're seeing now is not exponential improvement.
This is not going to rewrite and improve itself, or to cure cancer, unify physics or any kind of scientific or technological breakthrough.
For coders is is merely a dispensable QoL improvement.
careful. I too am pessimistic on the generative AI hype, but you seem even more so, to the point where it’s making you biased and possibly uninformed.
Today’s news from BBC, 6 hours ago. “AI designs antibiotics for gonorrhoea and MRSA superbugs”
https://www.bbc.com/news/articles/cgr94xxye2lo
> Now, the MIT team have gone one step further by using *generative AI* to design antibiotics in the first place for the sexually transmitted infection gonorrhoea and for potentially-deadly MRSA (methicillin-resistant Staphylococcus aureus).
…
> "We're excited because we show that generative AI can be used to design completely new antibiotics," Prof James Collins, from MIT, tells the BBC.
Generative AI is a lot of things. LLM’s in particular (subset of generative AI) are somewhat useful, but nowhere near as useful as what Sam claims. And i guess LLM’s specifically - if we focus on chatgpt, will not be solving cancer lol.
So we agree that Sam is selling snake oil. :)
Just wanted to point out that a lot of the fundamental “tech” is being used for genuinely useful things!
---------- One of those algorithms, known as chemically reasonable mutations (CReM), works by starting with a particular molecule containing F1 and then generating new molecules by adding, replacing, or deleting atoms and chemical groups. The second algorithm, F-VAE (fragment-based variational autoencoder), takes a chemical fragment and builds it into a complete molecule. It does so by learning patterns of how fragments are commonly modified, based on its pretraining on more than 1 million molecules from the ChEMBL database. ----------
(The technical article about the MIT work is here: https://www.cell.com/cell/abstract/S0092-8674(25)00855-4)
Both the MIT programs and GPT5 use "generative AI", but with entirely different training sets and perhaps very different training methods, architectures etc. Indeed, the AI systems used in the MIT work were described in conference papers in 2018 and 2020 (citations in the Cell paper), meaning that they preceded by quite a bit the current generations of GPT-5. In sum, the fact that the MIT model (reportedly) works well in developing antibiotics does not in any way imply that GPT-5 is a "scientific breakthrough", much less that LLMs will lead to AI that is able to "rewrite and improve itself, or to cure cancer, unify physics or any kind of scientific or technological breakthrough" (quoting the OP).
This is the natural progression of mass market business where cost savings is valued and quality is not. If you as a customer want a higher quality product, you are left to the edges of the market of boutique, bespoke, upscale experiences which are only able to be offered at high quality because their scale is small and more manageable in all metrics and their existence against the walmarts of their industry is dependent on being at a higher quality offering.
This piece's bias hurts its credibility. "It can’t even label a map" doesn't tell a story about the things these tools are useful for. And you know that something hundreds of millions of people are using has got to be pretty useful, or people wouldn't spend their money on it.
This would be a lot easier if they just published the weights for their models even if they did it with a delay of a couple years like Grok was supposed to. By keeping everything secret and then having some of the worst naming conventions anyone has come up with everyone just gets confused and frustrated. Combine that with the normal rug pull feeling of hosted software and the anxiety people get with SOTA AI and you have a perfect storm of upset users.
>“When bubbles happen, smart people get overexcited about a kernel of truth...Are we in a phase where investors as a whole are overexcited about AI? My opinion is yes."
How is Sam Altman admitting an AI bubble not front page news everywhere?
blibble•5mo ago
but no con can endure forever
dismalaf•5mo ago
The wall is very obvious now though.
dinkblam•5mo ago
no one in the industry could have believed that
stephc_int13•5mo ago
I am not in the industry but I've been following closely and I am usually skeptical, but while I erred on the side of "this is just a tool" I also wondered "what if?" more than once.
Herring•5mo ago
https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...
nwienert•5mo ago
So they're getting exponentially better are doing some easy fraction of programming work. But this would be like self-driving cars getting exponentially better at driving on very safe, easy roads, with absolutely no measurement towards something like chaotic streets, or rural back-roads, or edge cases like a semi swerving or weird reflections.