SamA has been promising AGI next year for three years like Musk has been promising FSD next year for the last ten years.
IDK what "people" are expecting but with the amount of hype I'd have to guess they were expecting more than we've gotten so far.
The fact that "fast takeoff" is a term I recognize indicates that some people believed OpenAI when they said this technology (transformers) would lead to sci fi style AI and that is most certainly not happening
Has he said anything about it since last September:
>It is possible that we will have superintelligence in a few thousand days (!); it may take longer, but I’m confident we’ll get there.
This is, at an absolute minimum, 2000 days = 5 years. And he says it may take longer.
Did he even say AGI next year any time before this? It looks like his predictions were all pointing at the late 2020s, and now he's thinking early 2030s. Which you could still make fun of, but it just doesn't match up with your characterization at all.
Not massively off -- manifold yesterday implied odds this low were ~35%. 30% before Claude Opus 4.1 came out which updated expected agentic coding abilities downward.
> As recently as June, the technical problems meant none of OpenAI’s models under development seemed good enough to be labeled GPT-5, according to a person who has worked on it.
But it could be that this refers to post-training and the base model was developed earlier.
https://www.theinformation.com/articles/inside-openais-rocky...
AI labs gather training data and then do a ton of work to process it, filter it etc.
Model training teams run different parameters and techniques against that processed training data.
It wouldn't surprise me to hear that OpenAI had collected data up to September 2024, dumped that data in a data warehouse of some sort, then spent months experimenting with ways to filter and process it and different training parameters to run against it.
> and minimizing sycophancy
Now we're talking about a good feature! Actually one of my biggest annoyances with Cursor (that mostly uses Sonnet).
"You're absolutely right!"
I mean not really Cursor, but ok. I'll be super excited if we can get rid of these sycophancy tokens.
The price should be compared to Opus, not Sonnet.
Is it actually simpler? For those who are currently using GPT 4.1, we're going from 3 options (4.1, 4.1 mini and 4.1 nano) to at least 8, if we don't consider gpt 5 regular - we now will have to choose between gpt 5 mini minimal, gpt 5 mini low, gpt 5 mini medium, gpt 5 mini high, gpt 5 nano minimal, gpt 5 nano low, gpt 5 nano medium and gpt 5 nano high.
And, while choosing between all these options, we'll always have to wonder: should I try adjusting the prompt that I'm using, or simply change the gpt 5 version or its reasoning level?
Trying to get an accurate answer (best correlated with objective truth) on a topic I don't already know the answer to (or why would I ask?). This is, to me, the challenge with the "it depends, tune it" answers that always come up in how to use these tools -- it requires the tools to not be useful for you (because there's already a solution) to be able to do the tuning.
If you have a task you do frequently you need some kind of benchmark. Which might just be comparing how good the output of the smaller models holds up to the output of the bigger model, if you don't know the ground truth
Not really that mich simpler.
But the specific nuance of picking nano/mini/main and minimal/low/medium/high comes down to experimentation and what your cost/latency constraints are.
With the API, you pick a model sizes and reasoning effort. Yes more choices, but also a clear mental model and a simple choice that you control.
Would been interesting to see a comparison between low, medium and high reasoning_effort pelicans :)
When I've played around with GPT-OSS-120b recently, seems the difference in the final answer is huge, where "low" is essentially "no reasoning" and with "high" it can spend seemingly endless amount of tokens. I'm guessing the difference with GPT-5 will be similar?
Yeah, I'm working on that - expect dozens of more pelicans in a later post.
Open source is years ahead of these guys on samplers. It's why their models being so good is that much more impressive.
They have confessed to doing a bad thing - training on copyrighted data without permission. Why does that indicate they would lie about a worse thing?
Because they know their audience. It's an audience that also doesn't care for copyright and would love for them to win their court cases. They are fineaking such an argument to those kinds of people.
Meanwhile, the reaction from the same audience when legal did a very typical subpoena process on said data, data they chose to submit to an online server of their own volition, completely freaked out. Suddenly, they felt like their privacy was invaded.
It doesn't make any logical sense in my mind, but a lot of the discourse over this topic isnt based on logic.
https://finance.yahoo.com/news/enterprise-llm-spend-reaches-...
Their PRO models were not (IMHO) worth 10X that of PLUS!
Not even close.
Especially when new competitors (eg. z.ai) are offering very compelling competition.
right :-D
or even: https://github.com/sst/opencode
Not affiliated with either one of these, but they look promising.
It does sort of give me the vibe that the pure scaling maximalism really is dying off though. If the approach is on writing better routers, tooling, comboing specialized submodels on tasks, then it feels like there's a search for new ways to improve performance(and lower cost), suggesting the other established approaches weren't working. I could totally be wrong, but I feel like if just throwing more compute at the problem was working OpenAI probably wouldn't be spending much time on optimizing the user routing on currently existing strategies to get marginal improvements on average user interactions.
I've been pretty negative on the thesis of only needing more data/compute to achieve AGI with current techniques though, so perhaps I'm overly biased against it. If there's one thing that bothers me in general about the situation though, it's that it feels like we really have no clue what the actual status of these models is because of how closed off all the industry labs have become + the feeling of not being able to expect anything other than marketing language from the presentations. I suppose that's inevitable with the massive investments though. Maybe they've got some massive earthshattering model release coming out next, who knows.
So yeah, maybe we are getting more incremental improvements. But that to me seems like a good thing, because more good things earlier. I will take that over world-shattering any day – but if we were to consider everything that has happened since the first release of gpt-4, I would argue the total amount is actually very much world-shattering.
In the meantime, figuring out how to train them to make less of their most common mistakes is a worthwhile effort.
The interesting point as well to me though, is that if it could create a startup that was worth $1B, that startup wouldn't be worth $1B.
Why would anyone pay that much to invest in the startup if they could recreate the entire thing with the same tool that everyone would have access to?
If your expectations were any higher than that then, then it seems like you were caught up in hype. Doubling 2-3 times per year isn't leveling off my any means.
It is a benchmark but I'm not very convinced it's the be-all, end-all.
Who's suggesting it is?
Rather than my personal opinion, I was commenting on commonly viewed opinions of people I would believe to have been caught up in hype in the past. But I do feel that although that's a benchmark, it's not necessarily the end-all of benchmarks. I'll reserve my final opinions until I test personally, of course. I will say that increasing the context window probably translates pretty well to longer context task performance, but I'm not entirely convinced it directly translates to individual end-step improvement on every class of task.
The common concept for AGI seems to be much more about human replacement - the ability to complete "economically valuable tasks" better than humans can. I still don't understand what our human lives or economies would look like there.
What I personally wanted from GPT-5 is exactly what I got: models that do the same stuff that existing models do, but more reliably and "better".
That's pretty much the key component these approaches have been lacking on, the reliability and consistency on the tasks they already work well on to some extent.
I think there's a lot of visions of what our human lives would look like in that world that I can imagine, but your comment did make me think of one particularly interesting tautological scenario in that commonly defined version of AGI.
If artificial general intelligence is defined as completed "economically valuable tasks" better than human can, it requires one to define "economically valuable." As it currently stands, something holds value in an economy relative to human beings wanting it. Houses get expensive because many people, each of whom have economic utility which they use to purchase things, want to have houses, of which there is a limited supply for a variety of reasons. If human beings are not the most effective producers of value in the system, they lose capability to trade for things, which negates that existing definition of economic value. Doesn't matter how many people would pay $5 dollars for your widget if people have no economic utility relative to AGI, meaning they cannot trade that utility for goods.
In general that sort of definition of AGI being held reveals a bit of a deeper belief, which is that there is some version of economic value detached from the humans consuming it. Some sort of nebulous concept of progress, rather than the acknowledgement that for all of human history, progress and value have both been relative to the people themselves getting some form of value or progress. I suppose it generally points to the idea of an economy without consumers, which is always a pretty bizarre thing to consider, but in that case, wouldn't it just be a definition saying that "AGI is achieved when it can do things that the people who control the AI system think are useful." Since in that case, the economy would eventually largely consist of the people controlling the most economically valuable agents.
I suppose that's the whole point of the various alignment studies, but I do find it kind of interesting to think about the fact that even the concept of something being "economically valuable", which sounds very rigorous and measurable to many people, is so nebulous as to be dependent on our preferences and wants as a society.
You've posted substantive comments in other threads, so this should be easy to fix.
If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.
Nothing in the current technology offers a path to AGI. These models are fixed after training completes.
> the model can remember stuff as long as it’s in the context.
You would need an infinite context or compressionAlso you might be interested in this theorem
Only if AGI would require infinite knowledge, which it doesn’t.
LLMs might look “creative” but they are just remixing patterns from their training data and what is in the prompt. They cant actually update themselves or remember new things after training as there is no ongoing feedback loop.
This is why you can’t send an LLM to medical school and expect it to truly “graduate”. It cannot acquire or integrate new knowledge from real-world experience the way a human can.
Without a learning feedback loop, these models are unable to interact meaningfully with a changing reality or fulfill the expectation from an AGI: Contribute to new science and technology.
Basically, I wouldn’t say that an LLM can never become AGI due to its architecture. I also am not saying that LLM will become AGI (I have no clue), but I don’t think the architecture itself makes it impossible.
So yeah, AGI is impossible with today LLMs. But at least we got to watch Sam Altman and Mira Murati drop their voices an octave onstage and announce “a new dawn of intelligence” every quarter. Remember Sam Altman 7 trillion?
Now that the AGI party is over, its time to sell those NVDA shares and prepare for the crash. What a ride it was. I am grabbing the popcorn.
You could maybe accomplish this if you could fit all new information into context or with cycles of compression but that is kinda a crazy ask. There's too much new information, even considering compression. It certainly wouldn't allow for exponential growth (I'd expect sub linear).
I think a lot of people greatly underestimate how much new information is created every day. It's hard if you're not working on any research and seeing how incremental but constant improvement compounds. But try just looking at whatever company you work for. Do you know everything that people did that day? It takes more time to generate information than process information so that's on you side, but do you really think you could keep up? Maybe at a very high level but in that case you're missing a lot of information.
Think about it this way: if that could be done then LLM wouldn't need training or tuning because you could do everything through prompting.
I’m not saying this is a realistic or efficient method to create AGI, but I think the argument „Model is static once trained -> model can’t be AGI“ is fallacious.
If you want actual big moves, watch google, anthropic, qwen, deepseek.
Qwen and Deepseek teams honestly seem so much better at under promising and over delivering.
Cant wait to see what Gemini 3 looks like too.
At this point it's pretty much given it's a game of inches moving forward.
According to the article, GPT-5 is actually three models and they can be run at 4 levels of thinking. Thats a dozen ways you can run any given input on "GPT-5", so its hardly a simple product line up (but maybe better than before).
Are you trying to say the curve is flattening? That advances are coming slower and slower?
As long as it doesn't suggest a dot com level recession I'm good.
But I do think the fact that we can publicly observe this reallocation of resources and emphasized aspects of the models gives us a bit of insight into what could be happening behind the scenes if we think about the reasons why those shifts could have happened, I guess.
> It does sort of give me the vibe that the pure scaling maximalism really is dying off though
I think the big question is if/when investors will start giving money to those who have been predicting this (with evidence) and trying other avenues.Really though, why put all your eggs in one basket? That's what I've been confused about for awhile. Why fund yet another LLMs to AGI startup. Space is saturated with big players and has been for years. Even if LLMs could get there that doesn't mean something else won't get there faster and for less. It also seems you'd want a backup in order to avoid popping the bubble. Technology S-Curves and all that still apply to AI
Though I'm similarly biased, but so is everyone I know with a strong math and/or science background (I even mentioned it in my thesis more than a few times lol). Scaling is all you need just doesn't check out
I think a somewhat comparable situation is in various online game platforms now that I think about it. Investors would love to make a game like Fortnite, and get the profits that Fortnite makes. So a ton of companies try to make Fortnite. Almost all fail, and make no return whatsoever, just lose a ton of money and toss the game in the bin, shut down the servers.
On the other hand, it may have been more logical for many of them to go for a less ambitious (not always online, not a game that requires a high player count and social buy-in to stay relevant) but still profitable investment (Maybe a smaller scale single player game that doesn't offer recurring revenue), yet we still see a very crowded space for trying to emulate the same business model as something like Fortnite. Another more historical example was the constant question of whether a given MMO would be the next "WoW-killer" all through the 2000's/2010's.
I think part of why this arises is that there's definitely a bit of a psychological hack for humans in particular where if there's a low-probability but extremely high reward outcome, we're deeply entranced by it, and investors are the same. Even if the chances are smaller in their minds than they were before, if they can just follow the same path that seems to be working to some extent and then get lucky, they're completely set. They're not really thinking about any broader bubble that could exist, that's on the level of the society, they're thinking about the individual, who could be very very rich, famous, and powerful if their investment works. And in the mind of someone debating what path to go down, I imagine a more nebulous answer of "we probably need to come up with some fundamentally different tools for learning and research a lot of different approaches to do so" is a bit less satisfying and exciting than a pitch that says "If you just give me enough money, the curve will eventually hit the point where you get to be king of the universe and we go colonize the solar system and carve your face into the moon."
I also have to acknowledge the possibility that they just have access to different information than I do! They might be getting shown much better demos than I do, I suppose.
Compared to the GPT-4 release which was a little over 2 years ago (less than the gap between 3 and 4), it is. The only difference is we now have multiple organizations releasing state of the art models every few months. Even if models are improving at the same rate, those same big jumps after every handful of months was never realistic.
It's an incremental stable improvement over o3, which was released what? 4 months ago.
> -------------------------------
"reasoning": {"summary": "auto"} }'
Here’s the response from that API call.
https://gist.github.com/simonw/1d1013ba059af76461153722005a0...
Without that option the API will often provide a lengthy delay while the model burns through thinking tokens until you start getting back visible tokens for the final response.
This is sort of interesting to me. It strikes me that so far we've had more or less direct access to the underlying model (apart from the system prompt and guardrails), but I wonder if going forward there's going to be more and more infrastructure between us and the model.
This has me so confused, Claude 4 (Sonnet and Opus) hallucinates daily for me, on both simple and hard things. And this is for small isolated questions at that.
>Looking through the document, I can identify several instances where it's written in the first person:
And it went on to show a series of "they/them" statements. I asked it to clarify if "they" is "first person" and it responded
>No, "they" is not first person - it's third person. I made an error in my analysis. First person would be: I, we, me, us, our, my. Second person would be: you, your. Third person would be: he, she, it, they, them, their. Looking back at the document more carefully, it appears to be written entirely in third person.
Even the good models are still failing at real-world use cases which should be right in their wheelhouse.
Could you give an estimate of how many "dumb errors" you've encountered, as opposed to hallucinations? I think many of your readers might read "hallucination" and assume you mean "hallucinations and dumb errors".
As a user, when the model tells me things that are flat out wrong, it doesn't really matter whether it would be categorized as a hallucination or a dumb error. From my perspective, those mean the same thing.
It's hard to know why it made the error but isn't it caused by inaccurate "world" modeling? ("World" being English language) Is it not making some hallucination about the English language while interpreting the prompt or document?
I'm having a hard time trying to think of a context where "they" would even be first person. I can't find any search results though Google's AI says it can. It provided two links, the first being a Quora result saying people don't do this but framed it as it's not impossible, just unheard of. Second result just talks about singular you. Both of these I'd consider hallucinations too as the answer isn't supported by the links.
Often the hallucinations I see are subtle, though usually critical. I see it when generating code, doing my testing, or even just writing. There are hallucinations in today's announcements, such as the airfoil example[0]. An example of more obvious hallucinations is I was asking for help improving writing an abstract for a paper. I gave it my draft and it inserted new numbers and metrics that weren't there. I tried again providing my whole paper. I tried again making explicit to not add new numbers. I tried the whole process again in new sessions and in private sessions. Claude did better than GPT 4 and o3 but none would do it without follow-ups and a few iterations.
Honestly I'm curious what you use them for where you don't see hallucinations
[0] which is a subtle but famous misconception. One that you'll even see in textbooks. Hallucination probably caused by Bernoulli being in the prompt
For factual information I only ever use search-enabled models like o3 or GPT-4.
Most of my other use cases involve pasting large volumes of text into the model and having it extract information or manipulates that text in some way.
If the question is about harder facts which the human disagrees with, this may put it into an essentially self-contradictory state, where the locus of possibilitie gets squished from each direction, and so the model is forced to respond with crazy outliers which agree with both the human and the data. The probability of an invented reference being true may be very low, but from the model's perspective, it may still be one of the highest probability outputs among a set of bad choices.
What it sounds like they may have done is just have the humans tell it it's wrong when it isn't, and then award it credit for sticking to its guns.
Yeah, it's seems to be a terrible approach to try to "correct" the context by adding clarifications or telling it what's wrong.
Instead, start from 0 with the same initial prompt you used, but improve it so the LLM gets it right in the first response. If it still gets it wrong, begin from 0 again. The context seems to be "poisoned" really quickly, if you're looking for accuracy in the responses. So better to begin from the beginning as soon as it veers off course.
So not seeing them means either lying or incompetent. I always try to attribute to stupidity rather than malice (Hanlon's razor).
The big problem of LLMs is that they optimize human preference. This means they optimize for hidden errors.
Personally I'm really cautious about using tools that have stealthy failure modes. They just lead to many problems and lots of wasted hours debugging, even when failure rates are low. It just causes everything to slow down for me as I'm double checking everything and need to be much more meticulous if I know it's hard to see. It's like having a line of Python indented with an inconsistent white space character. Impossible to see. But what if you didn't have the interpreter telling you which line you failed on or being able to search or highlight these different characters. At least in this case you'd know there's an error. It's hard enough dealing with human generated invisible errors, but this just seems to perpetuate the LGTM crowd
OPENAI_DEFAULT_MODEL=gpt-5 codex
ks2048•5h ago
kaoD•5h ago
simonw•4h ago
They used to be more about the training process itself, but that's increasingly secretive these days.