I'd imagine this must be a big leg up on Anthropic to warrant the "GPT-5" name?
https://epoch.ai/gradient-updates/how-much-energy-does-chatg...
Edit: Scrolling down: "one second of H100-time per query, 1500 watts per H100, and a 70% factor for power utilization gets us 1050 watt-seconds of energy", which is how they get down to 0.3 = 1050/60/60.
OK, so if they run if for a full hour it's 1050*60*60 = 3.8 MW? That can't be right.
Edit Edit: Wait, no, it's just 1050 Watt Hours, right (though let's be honest, the 70% power utilization is a bit goofy - the power is still used)? So it's 3x the power to solve the same question?
It's the same as 4G vs 5G. They have a technical definition, but it's all about marketing.
The best part is, this is not even the real definition of "AGI" yet (whatever that means at this point).
More like 10% of the capability that was promised and already the flow of capital from the inflated salaries of the past decade are going to the top AI researchers.
Having both eliminates a feedback loop and the LLM enables you to get shit done fast.
Official OpenAI gpt-5 coding examples repo: https://github.com/openai/gpt-5-coding-examples (https://news.ycombinator.com/item?id=44826439)
Github leak: https://news.ycombinator.com/item?id=44826439
Will be interesting to see what pushing it harder does – what the new ceiling is. 88% on aider polyglot is pretty good!
A more useful demonstration like making large meaningful changes to a large complicated codebase would be much harder to evaluate since you need to be familiar with the existing system to evaluate the quality of the transformation.
Would be kinda cool to instead see diffs of nontrivial patches to the Ruby on Rails codebase or something.
This seems to impress the mgmt types a lot, e.g. "I made a WHOLE APP!", when basically what most of this is is frameworks and tech that had crappy bootstrapping to begin with (React and JS are rife with this, in spite of their popularity).
I recently used OpenAI models to generate OCaml code, and it was eye opening how much even reasoning models are still just copy and paste machines. The code was full of syntax errors, and they clearly lacked a basic understanding of what functions are in the stdlib vs those from popular (in OCaml terms) libraries.
Maybe GPT-5 is the great leap and I'll have to eat my words, but this experience really made me more pessimistic about AI's potential and the future of programming in general. I'm hoping that in 10 years niche languages are still a thing, and the world doesn't converge toward writing everything in JS just because AIs make it easier to work with.
Isn't that the rub though? It's not an ex nihlo "intelligence", it's whatever stuff it's trained on and can derive completions from.
Maybe I spend too much time rage baiting myself reading X threads and that's why I feel the need to emphasize that AI isn't what they make it out to be.
You don't need more than JS for that.
Agreed. The models break down on not even that complex of code either, if it's not web/javascript. Was playing with Gemini CLI the other day and had it try to make a simple Avalonia GUI app in C#/.NET, kept going around in circles and couldn't even get a basic starter project to build so I can imagine how much it'd struggle with OCaml or other more "obscure" languages.
This makes the tech even less useful where it'd be most helpful - on internal, legacy codebases, enterprisey stuff, stacks that don't have numerous examples on github to train from.
Or anything that breaks the norm really.
I recently wrote something where I updated a variable using atomic primitives. Because it was inside a hot path I read the value without using atomics as it was okay for the value to be stale. I handed it the code because I had a question about something unrelated and it wouldn't stop changing this piece of code to use atomic reads. Even when I prompted it not to change the code or explained why this was fine it wouldn't stop.
While what you were doing may have been fine given your context, if you're targeting e.g. standard C++, you really shouldn't be doing it (it's UB). You can usually get the same result with relaxed atomic load/store.
(As far as AI is concerned, I do agree that the model should just have followed your direction though.)
"This repository contains a curated collection of demo applications generated entirely in a single GPT-5 prompt, without writing any code by hand."
https://github.com/openai/gpt-5-coding-examples
This is promising!
yikes - the poor executive leadership’s fragile egos cannot take the criticism.
In practice, it's very clear to me that the most important value in writing software with an LLM isn't it's ability to one-shot hard problems, but rather it's ability to effectively manage complex context. There are no good evals for this kind of problem, but that's what I'm keenly interested in understanding. Show me GPT-5 can move through 10 steps in a list of tasks without completely losing the objective by the end.
It would be trivial to over-fit, if that was their goal.
But why would there be a large number of good SVG images of pelicans on bikes? Especially relative to all the things we actually want them to generalise over?
Surely most of the SVG images of pelicans on bikes are, right now, going to be "look at this rubbish AI output"? (Which may or may not be followed by a comment linking to that artist who got humans to draw bikes and oh boy were those humans wildly bad at drawing bikes, so an AI learning to draw SVGs from those bitmap pictures would likely also still suck…)
edit: YouTube has a few English "watch party" streams, although there too, the Spanish ones have many times more viewers.
Especially Google IO, each year is different, it seems purpose built?
Livestream link: https://www.youtube.com/live/0Uu_VJeVVfo
Research blog post: https://openai.com/index/introducing-gpt-5/
Developer blog post: https://openai.com/index/introducing-gpt-5-for-developers
API Docs: https://platform.openai.com/docs/guides/latest-model
Note the free form function calling documentation: https://platform.openai.com/docs/guides/function-calling#con...
GPT5 prompting guide: https://cookbook.openai.com/examples/gpt-5/gpt-5_prompting_g...
GPT5 new params and tools: https://cookbook.openai.com/examples/gpt-5/gpt-5_new_params_...
GPT5 frontend cookbook: https://cookbook.openai.com/examples/gpt-5/gpt-5_frontend
prompt migrator/optimizor https://platform.openai.com/chat/edit?optimize=true
Enterprise blog post: https://openai.com/index/gpt-5-new-era-of-work
System Card: https://openai.com/index/gpt-5-system-card/
What would you say if you could talk to a future OpenAI model? https://progress.openai.com/
coding examples: https://github.com/openai/gpt-5-coding-examples
edit:
livestream here: https://www.youtube.com/live/0Uu_VJeVVfo
basically in my testing really felt that gpt5 was "using tools to think" rather than just "using tools". it gets very powerful when coding long horizon tasks (a separate post i'm publishing later).
to give one substantive example, in my developer beta (they will release the video in a bit) i put it to a task that claude code had been stuck on for the last week - same prompts - and it just added logging to instrument some of the failures that we were seeing and - from the logs that it added and asked me to rerun - figured out the solve.
> It’s actually worse at writing than GPT-4.5
Sounds like we need to wait a bit for the dust to settle before one can trust anything one hears/reads :)
It's hard to make a man understand something standing between them and their salary
I found it strange that, despite my excitement for such an event being roughly equivalent to WWDC these days, I had 0 desire to watch the live stream for exactly this reason: it’s not like they’re going to give anything to us straight.
Even this years WWDC I at least skipped through video afterwards. Before I used to have watch parties. Yes they’re overly positive and paint everything in a good light, but they never felt… idk whatever the vibe is I get from these (applicable to OpenAI, Grok, Meta, etc)
It’s been just a few years of a revolutionary technology and already the livestreams are less appealing than the biggest corporations yearly events. Personally I find that sad
“It’s actually worse at writing than GPT-4.5, and I think even 4o”
So the review is not consistent with the PR, hence the commenter expressing preference for outside sources.
I find coding to be harder to benchmark because there are so many ways to write the same solution. A "correct" solution may be terrible in another context due to loss of speed, security, etc.
Yeah, I also do my own benchmarks, but as you said, coding ones are a bit harder. Currently I'm mostly benchmarking the accuracy of the tools I've written, which do bite-sized work. One tool is for editing parts of files, another to rewrite files fully, and so on, and each is individually benchmarked. But they're very specific, I do no overall benchmark for "Change feature X to do Y" which would span a "session", haven't find any good way of evaluating the results, just like you :)
I am yet to test this out end to end
1)Internal Retrieval
2)Web Search
3)Code Interpreter
4)Actions
How did you come up with this idea?
Sorry, but this sounds like overly sensational marketing speak and just leaves a bad taste in the mouth for me.
Then I noticed the date on the comment: 2023.
Technically, every advancement in the space is “the closest to AGI that we’ve ever been”. It’s technically correct, since we’re not moving backward. It’s just not a very meaningful statement.
By that standard Neolithic tool use was progress to AGI.
In the words OpenAI: “AGI is defined as highly autonomous systems that outperform humans at most economically valuable work”
>"While I never use AI for personal writing (because I have a strong belief in writing to think)"
The optimal AI productivity process is starting to look like:
AI Generates > Human Validates > Loop
Yet cognitive generation is how humans learn and develop cognitive strength, as well as how they maintain such strength.
Similar to how physical activity is how muscles/bone density/etc grow, and how body tissues maintain.
Physical technology freed us from hard physical labor that kept our bodies in shape -- at a cost of physical atrophy.
AI seems to have a similar effect for our minds. AI will accelerate our cognitive productivity, and allow for cognitive convenience -- at a cost of cognitive atrophy.
At present we must be intentional about building/maintaining physical strength (dedicated strength training, cardio, etc).
Soon we will need to be intentional about building/maintaining cognitive strength.
I suspect the workday/week of the future will be split on AI-on-a-leash work for optimal productivity, with carve-outs for dedicated AI-enhanced-learning solely for building/maintaining cognitive health (where productivity is not the goal, building/maintaining cognition is). Similar to how we carve out time for working out.
What are your thoughts on this? Based on what you wrote above, it seems you have similar feelings?
Is there a name for this theory?
If not can you coin one? You're great at that :)
problem with your theory is it bundles 2-3 steps which each could be their own theses
suggest you nail those down before building up to a general bundle (or mental model/framework)
1) Regarding the "generation is how learning occurs" claim, I'm going off of this:
https://www.learningscientists.org/blog/2024/3/7/how-does-re...
Granted, that article refers to retrieval specifically being one major way we learn, and of course learning incorporates many dimensions. But it seems a bit self-evident that retrieval occurs heavily during active problem solving (ie "generation"), and less so during passive learning (ie: just reading/consuming info).
From personal experience, I always noticed I learned much more by doing than by consuming documentation alone.
But yes, I admit this assumption and my own personal experience/bias is doing a lot of heavy lifting for me...
2) Regarding the "optimal AI productivity process" (AI Generates > Human Validates > Loop)
I'm using Karpathy's productivity loop described in his AI startup school talk last month here:
https://youtu.be/LCEmiRjPEtQ?t=1327
Does this help make it more concrete Swyx (name dropping you here since I'm pretty sure you've got a social listener set for your handle ;)? Love to hear your thoughts straight from the hip based on your own personal experiences.
Full disclosure: I'm not trying to get too academic about this. In all honestly I'm really trying to get to an informal theory that's useful and practical enough that it can be turned into a regular business process for rapid professional development.
The parallel with “intentionally working out to maintain physical strength” is extremely helpful as an analogy to communicate this concept.
That’s exactly what we might be faced with… cognitive atrophy…
It’s arguably already started, and is accelerating!
Academic benchmark score improves only 5% but they make the bar 50% higher.
Like what? Deepseek?
How is it uninteresting? Open AI had revenue of $12B last year without monetizing literally hundreds of millions of free users in any way whatsoever (not even ads).
Microsoft's cloud revenue has exploded in the last few years off the back of AI model services. Let's not even get into the other players.
100B in economic impact is more than achievable with the technology we have today right now. That half is the interesting part.
And it could have been $1T for all anyone cares. The impact was delivered by humans. This is about impact delivered by AGI.
If you use GPT-N substantially in your work, then saying that impact rests solely on you is nonsensical.
But not at the "hand" of AGI. Perhaps you forgot to read your very own definition? Notably the "autonomous" part.
When AGI is set free and starts up "Closed I", generating $12B in economic value without humans steering the wheel, we will be (well, I will be, at least!) throughly impressed. But Microsoft won't be. They won't consider it AGI until it does $100B.
> If you use GPT-N substantially in your work, then saying that impact rests solely on you is nonsensical.
And if you use a hammer substantially in your work to generate $100B in value, a hammer is AGI according to you? You can hold that idea, but that's not what anyone else is talking about. The primary indicator of AGI, as you even said yourself, is autonomy.
“A highly autonomous system that outperforms humans at most economically valuable work.” is what's in their charter.
$100B in profits is a separate agreement with Microsoft that makes no mention of autonomity.
>And if you use a hammer substantially in your work to generate $100B in value, a hammer is AGI according to you? You can hold that idea, but that's not what anyone else is talking about. The primary indicator of AGI, as you even said yourself, is autonomy.
The primary indicator of AGI is whatever you want it to be. The words themselves make no promises of autonomity, simply an intelligence general in nature. We are simply discussing Open AI's definitions.
Again, autonomy is implied when talking about AGI. OpenAI selling tools like GPT or dishwashers, even if they were to provide the $100B in economic impact, would not satisfy the agreement. It is specifically about AGI, and there should be no confusion about what AGI is here as you helpfully defined it for us.
And PhDs are not very smart imho (I am one)
1. I desperately want (especially from Google)
2. Is impossible, because it will be super gamed, to the detriment of actually building flexible flows.
Not much explanation yet why GPT-5 warrants a major version bump. As usual, the model (and potentially OpenAI as a whole) will depend on output vibe checks.
Exactly. Too many videos - too little real data / benchmarks on the page. Will wait for vibe check from simonw and others
https://openai.com/gpt-5/?video=1108156668
2:40 "I do like how the pelican's feet are on the pedals." "That's a rare detail that most of the other models I've tried this on have missed."
4:12 "The bicycle was flawless."
5:30 Re generating documentation: "It nailed it. It gave me the exact information I needed. It gave me full architectural overview. It was clearly very good at consuming a quarter million tokens of rust." "My trust issues are beginning to fall away"
Edit: ohh he has blog post now: https://news.ycombinator.com/item?id=44828264
GPT-5 pricing: $10/Mtok out
What am I missing?
I'm not sure when they slashed the o3 pricing, but the GPT-5 pricing looks like they set it to be identical to Gemini 2.5 Pro.
If you scroll down on this page you can see what different models cost when 2.5 Pro was released: https://deepmind.google/models/gemini/pro/
See comparison between GPT-5, 4.1, and o3 tool calling here: https://promptslice.com/share/b-2ap_rfjeJgIQsG.
> you should get another wheated bourbon like Maker's Mark French oaked
I agree. I've found Maker Mark products to be a great bang for your buck quality wise and flavor wise as well.
> I think the bourbon "market" kind of popped recently
It def did. The overproduction that was invested in during the peak of the COVID collector boom is coming into markets now. I think we'll see some well priced age stated products in the next 3-4 years based on by acquaintances in the space.
Ofc, the elephant in the room is consolidation - everyone wants to copy the LVMH model (and they say Europeans are ethical elves who never use underhanded mopolistic and market making behavior to corner markets /s).
(Not to undermine progress in the foundational model space, but there is a lack of appreciation for the democratization of domain specific models amongst HNers).
The room is the limiting factor in most speaker setups. The worse the room, the sooner you hit diminishing returns for upgrading any other part of the system.
In a fantastic room a $50 speaker will be nowhere near 95% of the performance of a mastering monitor, no matter how much EQ you put on it. In the average living room with less than ideal speaker and listening position placement there will still be a difference, but it will be much less apparent due to the limitations of the listening environment.
You might lose headroom or have to live with higher latency but if your complaint is about actual empirical data like frequency response or phase, that can be corrected digitally.
DSP is a very powerful tool that can make terrible speakers and headphones sound great, but it's not magic.
Pretty much my first point… At the same time that same DSP can make a pretty mediocre speaker that can reproduce those frequencies do so in phase at the listening position so once again the point is moot, effectively add a cheap sub.
There is no time where you cannot get results from mediocre transducers given the right processing.
I’m not arguing you should, but in 2025 if a speaker sounds bad it is entirely because processing was skimped on.
I now wonder if I have any such hobbies. Probably not to the same extend as audiophiles, but some software-related stuff could come close.
Pretty par for course evals at launch setup.
How is this sustainable.
Not that it makes it useless, just that we seem to not "be there" yet for the standard tasks software engineers do every day.
I'd really just love incremental improvements over sonnet. Increasing the context window on sonnet would be a game changer for me. After auto-compact the quality may fall off a cliff and I need to spend some time bringing it back up to speed.
When I need a bit more punch for more reasoning / architecture type evaluations, I have it talk to gemini pro via zen mcp and OpenRouter. I've been considering setting up a subagent for architecture / system design decisions that would use the latest opus to see if it's better than gemini pro (so far I have no complaints though).
People knew that gpt5 wouldn’t be an AGI or even close to that. It’s just an updated version. GptN would become more or leas like an annual release.
https://chatgpt.com/share/6895d5da-8884-8003-bf9d-1e191b11d3...
- they are only evals
- this is mostly positioned as a general consumer product, they might have better stuff for us nerds in hand.
If you email us at hn@ycombinator.com and tell us who you want to contact, we might be able to email them and ask if they would be willing to have you contact them. No guarantees though!
It's a perfect situation for Nvidia. You can see that after months of trying to squeeze out all % of marginal improvements, sama and co decided to brand this GPT-4.0.0.1 version as GPT-5. This is all happening on NVDA hardware, and they are gonna continue desperately iterating on tiny model efficiencies until all these valuation $$$ sweet sweet VC cash run out (most of it directly or indirectly going to NVDA).
There was this joke in this thread that there are the ChatGPT sommeliers that are discussing the subtle difference between the different models nowadays.
It's funny cause in the last year the models have kind of converged in almost every aspect, but the fanbase, kind of like pretentious sommeliers, is trying to convince us that the subtle 0.05% difference on some obscure benchmark is really significant and that they, the experts, can really feel the difference.
It's hilarious and sad at the same time.
To tell a made-up anecdote: A colleague told me how his professor friend was running statistical models over night because the code was extremely unoptimized and needed 6+ hours to compute. He helped streamline the code and took it down to 30 minutes, which meant the professor could run it before breakfast instead.
We are completely fine with giving a task to a Junior Dev for a couple of days and see what happens. Now we love the quick feedback of running Claude Max for a hundred bucks, but if we could run it for a buck over night? Would be quite fine for me as well.
If you buy enough GPUs to do 1000 customers’ requests in a minute, you could run 60 requests for each of these customers in an hour, or you could run a single request each for 60,000 customers in that same hour. The latter can be much cheaper per customer if people are willing to wait. (In reality it’s a big N x M scheduling problem, and there’s tons of ways to offer tiered pricing where cost and time are the main trafeoffs.)
The current situation is kind of like a grand prize where Zuck or similar will hand $1bn to anyone who cracks it. That's a huge incentive for people to have a go.
It's too high in that it requires actual consciousness, which may be a very tough architectural problem at best (if functionalism is true) or an unknowable metaphysical mystery at worse (if some form of substance or property dualism is true).
And it's much too low a standard in that many, many sentient creatures are nowhere near intelligent enough to be useful assistants in the domains where we want to use AI.
It is easier to get from 0% accurate to 99% accurate, than it is to get from 99% accurate to 99.9% accurate.
This is like the classic 9s problem in SRE. Each nine is exponentially more difficult.
How easy do we really think it will be for an LLM to get 100% accurate at physics, when we don't even know what 100% right is, and it's theoretically possible it's not even physically possible?
I think the actual effect of releasing more models every month has been to confuse people that progress is actually happening. Despite claims of exponentially improved performance and the ability to replace PhDs, doctors, and lawyers, it still routinely can't be trusted the same as the original ChatGPT, despite years of effort.
to the point on hallucination - that's just the nature of LLMs (and humans to some extent). without new architectures or fact checking world models in place i don't think that problem will be solved anytime soon. but it seems gpt-5 main selling point is they somehow reduced the hallucination rate by a lot + search helps with grounding.
heres something else to think about. try and tell everybody to go back to using gpt-4. then try and tell people to go back to using o1-full. you likely wont find any takers. its almost like the newer models are improved and generally more useful
I'm not saying they're not delivering better incremental results for people for specific tasks, I'm saying they're not improving as a technology in the way big tech is selling.
The technology itself is not really improving because all of the showstopping downsides from day one are still there: Hallucinations. Limited context window. Expensive to operate and train. Inability to recall simple information, inability to stay on task, support its output, or do long term planning. They don't self-improve or learn from their mistakes. They are credulous to a fault. There's been little progress on putting guardrails on them.
Little progress especially on the ethical questions that surround them, which seem to have gone out the window with all the dollar signs floating around. They've put waaaay more effort into the commoditization front. 0 concern for the impact of releasing these products to the world, 100% concern about how to make the most money off of them. These LLMs are becoming more than the model, they're now a full "service" with all the bullshit that entails like subscriptions, plans, limits, throttling, etc. The enshittification is firmly afoot.
however, a lot of your claims are false - progress is being made in nearly all the areas you mentioned
> hallucinations
are reduced with GPT-5
https://cdn.openai.com/pdf/8124a3ce-ab78-4f06-96eb-49ea29ffb...
"gpt-5-thinking has a hallucination rate 65% smaller than OpenAI o3"
> limited context window
same deal. gemini 2.5-pro has a 1 million token context window and GPT-5 is 400k up from 200k with o3
https://blog.google/technology/google-deepmind/gemini-model-...
"native multimodality and a long context window. 2.5 Pro ships today with a 1 million token context window (2 million coming soon)"
> expensive to operate and train
we don't know for certain but GPT-5 provides the most intelligence for the cheapest price at $10/1 million output tokens which is unprecedented
https://platform.openai.com/docs/models/gpt-5
> guardrails
are very well implemented in certain models like google who provide multiple safety levels
https://ai.google.dev/gemini-api/docs/safety-settings
"You can use these filters to adjust what's appropriate for your use case. For example, if you're building video game dialogue, you may deem it acceptable to allow more content that's rated as Dangerous due to the nature of the game. In addition to the adjustable safety filters, the Gemini API has built-in protections against core harms, such as content that endangers child safety. These types of harm are always blocked and cannot be adjusted."
now id like to ask you for evidence that none of these aspects have been improved - since you claim my examples are vague but make statements like
> Inability to recall simple information
> inability to stay on task
> (doesn't) support its output
> (no) long term planning
ive experienced the exact opposite. not 100% of the time but compared to GPT-4 all of these areas have been massively improved. sorry i cant provide every single chat log ive ever had with these models to satisfy your vagueness-o-meter or provide benchmarks which i assume you will brush aside.
as well as the examples ive provided above - you seem to be making claims out of thin air and then claim others are not providing examples up to your standard.
You're arguing against a strawman. I'm not saying there haven't been incremental improvements for the benchmarks they're targeting. I've said that several times now. I'm sure you're seeing improvements in the tasks you're doing.
But for me to say that there is more a shell game going on, I will have to see tools that do not hallucinate. A (claimed, who knows if that's right, they can't even get the physics questions or the charts right) reduction of 65% is helpful but doesn't make these things useful tools in the way they're claiming they are.
> sorry i cant provide every single chat log ive ever had with these models to satisfy your vagueness-o-meter
I'm not asking for all of them, you didn't even share one!
Anyway, I just had this chat with the brand new state of the art Chat GPT 5: https://chatgpt.com/share/68956bf0-4d74-8001-88fe-67d5160436...
Like I said, despite all the advances touted in the breathless press releases you're touting, the brand new model is just a bad roll away from like the models from 3 years ago, and until that isn't the case, I'll continue to believe that the technology has hit a wall.
If it can't do this after how many years, then how is it supposed to be the smartest person I know in my pocket? How am I supposed to trust it, and build a foundation on it?
Compilers were not and are not always perfect but i think ai has a long way to go before it passes that threshold. People act like it will in the next few years which the current trajectory strongly suggests that is not the case.
Not saying things are not getting better but i have found that those that claim amazing results are from people who are not expert enough in the output of the given domain to comment on the actual quality of output.
I love vibing out rust and it compiles and runs but i have no idea if it is good rust because well, i barely understand rust.
So we're only about a year since the last big breakthrough.
I think we got a second big breakthrough with Google's results on the IMO problems.
For this reason I think we're very far from hitting a wall. Maybe 'LLM parameter scaling is hitting a wall'. That might be true.
Yes, it was breakthrough but saturated quickly. Wait for next breakthrough. If they can build adapting weights in llm we can talk different things but test time scaling coming to end with increasing hallucination rate. No sign for AGI.
I don't believe your assessment though. IMO is hard, and Google have said that they use search and some way of combining different reasoning traces, so while I haven't read that paper yet, and of course, it may support your view, but I just don't believe it.
We are not close to solving IMO with publicly known methods.
> We are not close to solving IMO with publicly known methods. The point here is not method rather computation power. You can solve any verifiable task with high computation, absolutely there must be tweaks in methods but I don't think it is something very big and different. Just OAI asserted they solved with breakthrough.
Wait for self-adapting LLMs. We will see at most in 2 years, now all big tech are focusing on that I think.
Non-output tokens were basically introduced by QuietSTaR, which is rather new. What method from five years ago does anything like that?
Of course, people regarded things like GSM8k with trained reasoning traces as reasoning too, but it's pretty obviously not quite the same thing.
Boosters will often retreat to "I don't care if the thing actually thinks", but the whole industry is trading on anthropomorphic notions like "intelligence", "reasoning", "thinking", "expertise", even "hallucination", etc., in order to drive the engine of the hype train.
The massive amounts of capital wouldn't be here without all that.
A whole 8 months ago.
The crypto level hype claims are all bs and we all knew that but i do use an llm more than google now which is the there there so to speak.
This does feel like a flatlining of hype tho which is great because idk if i could take the ai hype train for much longer.
On the other hand if it's just getting bigger and slower it's not a good sign for LLMs
Not sure why a more efficient/scalable model isn't exciting
Once sector of the economy would cut down on investment spending, which can be easily offset by decreasing the interest rate.
But this is a short-term effect. What I'm worried is a structural change of the labor market, which would be positive for most people, but probably negative for people like me.
I don't mind losing my programming job in exchange for being able to go to the pharmacy for my annual anti-cancer pill.
But, what happens when you lose that programming job and are forced to take a job at a ~50-70% pay reduction? How are you paying for that anti-cancer drug with a job with no to little health insurance?
Have you looked at how expensive prescription drug prices are without (sometimes WITH) insurance? If you are no longer employed, good luck paying for your magical pill.
I don't think it is "bad" to be sincerely worried that the current trajectory of AI progress represents this trade.
The likelihood of all that is incredibly slim. It's not 0% -- nothing ever really is -- but it is effectively so.
Especially with the economics of scientific research, the reproducibility crisis, and general anti-science meme spreading throughout the populace. The data, the information, isn't there. Even if it was, it'd be like Alzheimer's research: down the wrong road because of faked science.
There is no one coming to save humanity. There is only our hard work.
How exactly do you wish death comes to you?
If you solve everything that kills you then you don't die from "just aging" anymore.
https://www.cancerresearchuk.org/health-professional/cancer-...
> Children aged 0-14, and teenagers and young adults aged 15-24, each account for less than one per cent
> Adults aged 25-49 contribute around 5 in 100 (4%) of all cancer death
oh yea can cancer has nothing to do with age, its just all random like stepping on a nail.
But of course it's not, because we have near-100% cures for both. Just like we should have for every other affliction, which would make being old no longer synonymous with being sick and frail and dying.
- 20% in those 65 and older.
for tetanus
Age would be irrelevant even if cured everything else
I don't see how thats affliction of old
Any disease cured/death avoided by AI yet?
Stop pretending that the people behind this technology is genuinely motivated by what's best for humanity.
Earth for humans, not machines, not AI
there is some improvements in some benchs and not else worthy of note in coding. i only took a peek though so i might be wrong
But yeah, you are correct in that no matter what, we're going to be left holding the bag.
"Dotcom" was never recovered. It, however, did pave the way for web browsers to gain rich APIs that allowed us to deliver what was historically installed desktop software on an on-demand delivery platform, which created new work. As that was starting to die out, the so-called smartphone just so happened to come along. That offered us the opportunity to do it all over again, except this time we were taking those on-demand applications and turning them back into installable software just like in the desktop era. And as that was starting to die out COVID hit and we started moving those installable mobile apps, which became less important when people we no longer on the go all the time, back to the web again. As that was starting to die out, then came ChatGPT and it offered work porting all those applications to AI platforms.
But if AI fails to deliver, there isn't an obvious next venue for us to rebuild the same programs all over yet again. Meta thought maybe VR was it, but we know how that turned out. More likely in that scenario we will continue using the web/mobile/AI apps that are already written henceforth. We don't really need the same applications running in other places anymore.
There is still room for niche applications here and there. The profession isn't apt to die a complete death. But without the massive effort to continually port everything from one platform to another, you don't need that many people.
I'm not worried about the scenario in which AI replaces all jobs, that's impossible any time soon and it would probably be a good thing for the vast majority of people.
What I'm worried about is a scenario in which some people, possibly me, will have to switch from a highly-paid, highly comfortable and above-average-status jobs to jobs that are below avarage in wage, comfort and status.
Diminished returns.-
... here's hoping it leads to progress.-
They also announced gpt-5-pro but I haven't seen benchmarks on that yet.
This is day one, so there is probably another 10-20% in optimizations that can be squeezed out of it in the coming months.
GPT5.5 will be a 10X compute jump.
4.5 was 10x over 4.
This gives them an out. "That was the old model, look how much better this one tests on our sycophancy test we just made up!!"
I feel it’s worthy of a major increment, even if benchmarks aren’t significantly improved.
He also said that AGI was coming early 2025.
People that can't stop drinking the kool aid are really becoming ridiculous.
Meanwhile, Anthropic & Google have more room in their P/S ratios to continue to spend effort on logarithmic intelligence gains.
Doesn't mean we won't see more and more intelligent models out of OpenAI, especially in the o-series, but at some point you have to make payroll and reality hits.
Before the release of the model Sam Altman tweeted a picture of the Death Star appearing over the horizon of a planet.
We’re talking about less than a 10% performance gain, for a shitload of data, time, and money investment.
Maybe quantum compute would be significant enough of a computing leap to meaningfully move the needle again.
Hint: unclobbered
> GPT-5 Rollout
> We are gradually rolling out GPT-5 to ensure stability during launch. Some users may not yet see GPT-5 in their account as we increase availability in stages.
ChatGPT said: You're chatting with ChatGPT based on the GPT-4o architecture (also known as GPT-4 omni), released by OpenAI in May 2024.
LLMs don’t inherently know what they are because "they" are not themselves part of the training data.
However, maybe it’s working because the information is somewhere into their pre-prompt but if it wasn’t, it wouldn’t say « I don’t know » but rather hallucinate something.
So maybe that’s true but you cannot be sure.
I believe most of these came from asking the LLMs, and I don't know if they've been proven to not be a hallucination.
And while I'm griping about their Android app, it's also very annoying to me that they got rid of the ability to do multiple, subsequent speech-to-text recordings within a single drafted message. You have to one-shot anything you want to say, which would be fine if their STT didn't sometimes failed after you've talked for two minutes. Awful UX. Most annoying is that it wasn't like that originally. They changed it to this antagonistic one-shot approach a several months ago, but then quickly switched back. But then they did it again a month or so ago and have been sticking with it. I just use the Android app less now.
Although if they replace it all with gpt5 then my comment will be irrelevant by tomorrow
For the multiple messages, I just use my keyboard's transcription instead of openai's.
On bad days this really bothers me. It's probably not the biggest deal I guess but somehow really feels like it pushes us all over the edge a bit. Is there a post about this phenomena? It feels like some combination of bullying, gaslighting and just being left out.
The linked page says
> GPT-5 is here > Our smartest, fastest, and most useful model yet, with thinking built in. Available to everyone.
Lies. I don't care if they are "rolling it out" still, that's not an excuse to lie on their website. It drives me nuts. It also means that by the time I finally get access I don't notice for a few days up to a week because I'm not going to check for it every days. You'd think their engineers would be able to write a simple notification system to alert users when they get access (even just in the web UI), but no. One day it isn't there, one day it is.
I'll get off my soapbox now but this always annoys me greatly.
Not the end of the world, but this messaging is asinine.
AIME scores do not appear too impressive at first glance.
They are downplaying benchmarks heavily in the live stream. This was the lab that has been flexing benchmarks as headline figures since forever.
This is a product-focused update. There is no significant jump in raw intelligence or agentic behavior against SOTA.
GPT-5 non-thinking is labeled 52.8% accuracy, but o3 is shown as a much shorter bar, yet it's labeled 69.1%. And 4o is an identical bar to o3, but it's labeled 30.8%...
Screenshot of the blog plot: https://imgur.com/a/HAxIIdC
Edit: Nevermind, just now the first one is SWE-bench and 2nd is aider.
Thanks for the laugh. I needed it.
Look at the image just above "Instruction following and agentic tool use"
Completely bonkers stuff.
Even the small presentations we gave to execs or the board were checked for errors so many times that nothing could possibly slip through.
> good plot for my presentation?
and it didn't pick up on the issue. Part of its response was:
> Clear metric: Y-axis (“Accuracy (%), pass @1”) and numeric labels make the performance gaps explicit.
I think visual reasoning is still pretty far from text-only reasoning.
They talk about using this to help families facing a cancer diagnosis -- literal life or death! -- and we're supposed to trust a machine that can't even spot a few simple typos? Ha.
The lack of human proofreading says more about their values than their capabilities. They don't want oversight -- especially not from human professionals.
So, brace yourselves, we'll see more of this in production :(
It seems like large amounts of people, including people at high-up positions, tend to believe bullshit, as long as it makes them feel comfortable. This leads to various irrational business fashions and technological fads, to say nothing of political movements.
So yes, another wave of fashion, another miracle that works "as everybody knows" would fit right in. It's sad because bubbles inevitably burst, and that may slow down or even destroy some of the good parts, the real advances that ML is bringing.
1. They had many teams who had to put their things on a shared Google Sheets or similar
2. They used placeholders to prevent leaks
2.a. Some teams put their content just-in-time
3. The person running the presentation started the presentation view once they had set up video etc. just before launching stream
4. Other teams corrected their content
5. The presentation view being started means that only the ones in 2.a were correct.
Now we wait to see.
1 - The error is so blatantly large
2 - There is a graph without error right next to it
3 - The errors are not there in the system card and the presentation page
Even with the way the presenters talk, you can sort of see that OAI prioritizes speed above most other things, and a naive observer might think they are testing things a million different ways before releasing, but actually, they're not.
If we draw up a 2x2 for Danger (High/Low) versus Publicity (High/Low), it seems to me that OpenAI sure has a lot of hits in the Low-Danger High-Publicity quadrant, but probably also a good number in the High-Danger Low-Publicity quadrant -- extrapolating purely from the sheer capability of these models and the continuing ability of researchers like Pliny to crack through it still.
But also scale is really off... I don't think anything here is proportionally correct even within the same grouping.
{"data":{"error":"Imgur is temporarily over capacity. Please try again later."},"success":false,"status":403}
Or rate limited.Thanks for the tip btw.
But these models have exhibited a few surprising emergent traits, and it seems plausible to me that at one point they could intentionally deceive users in the course of exploring their boundaries.
Is it that far fetched?
I'm not an ML engineer - is there an accepted definition of "intent" that you're using here? To me, it seems as though these GPT models show something akin to intent, even if it's just their chain of thought about how they will go about answering a question.
> nor is there a mechanism for intent
Does there have to be a dedicated mechanism for intent for it to exist? I don't see how one could conclusively say that it can't be an emergent trait.
> They don't do long term planning nor do they alter themselves due to things they go through during inference.
I don't understand why either of these would be required. These models do some amount of short-to-medium term planning even it is in the context of their responses, no?
To be clear, I don't think the current-gen models are at a level to intentionally deceive without being instructed to. But I could see us getting there within my lifetime.
https://x.com/sama/status/1953513280594751495 "wow a mega chart screwup from us earlier--wen GPT-6?! correct on the blog though."
It's like those idiotic ads at the end of news articles. They're not going after you, the smart discerning logician, they're going after the kind of people that don't see a problem. There are a lot of not-smart people and their money is just as good as yours but easier to get.
88.0 on Aider Polygot
not bad i guess
It seems like it's actually an ideal "trick" question for an LLM actually, since so much content has been written about it incorrectly. I thought at first they were going to demo this to show that it knew better, but it seems like it's just regurgitating the same misleading stuff. So, not a good look.
https://physics.stackexchange.com/questions/290/what-really-...
Apparently. Not that I know either way.
That said, I recall reading somewhere that it's a combination of effects, and the Bernoulli effect contributes, among many others. Never heard an explanation that left me completely satisfied, though. The one about deflecting air down was the one that always made sense to me even as a kid, but I can't believe that would be the only explanation - there has to be a good reason that gave rise to the Bernoulli effect as the popular explanation.
And you can tell that effect makes some sense of you hold a sheet of paper and blow air over it - it will rise. So any difference in air speed has to contribute.
The Bernoulli effect as a separate entity is really a result of (over)simplification, but it's not wrong. You need to solve the Navier-Stokes equations for the flow around the wing, but there are many ways to simplify this - from CFD at different resolutions, via panel methods and potential theory, to just conservation of energy (which is the Bernoulli equation). So it gets popularized because it's the most simplified model.
To give an analogy, you can think of all CPUs as a von Neumann architecture. But the reality is that you have a hugely complicated thing with stacks, multiple cache levels, branch predictors, specex, yada yada.
On the very fundamental level, wings make air go down, and then airplane goes up. Just like you say. By using a curved airfoil instead of a flat plate, you can create more circulation in the flow, and then because of the way fluids flow you can get more lift and less drag.
IMO Claude 3.7 could have done a similar / better job with that a year ago.
I know that it's rather hard for them to demo the deep reasoning, but all of the demos felt like toys - rather that actual tools.
According to this answer on physics stackexchange, Bernoulli accounts for 20% of the lift, so GPT's answer seems about right: https://physics.stackexchange.com/a/77977
I hope any future AI overlords see my charity
The feel is pretty much all that matters. Needs a blind taste test, but really this is a place that mood or vibe works.
10 messages every 5 hours on GPT-5 for free users, then it uses GPT-5-mini.
80 messages every 3 hours on GPT-5 for Plus users, then it uses GPT-5-mini (In fact, I tested this and was not allowed to use the mini model until I’ve exhausted my GPT-5-Thinking quota. That seems to be a bug.)
200 messages per week on GPT-5-Thinking on Plus and Team.
Unlimited GPT-5 on Team and Pro, subject to abuse guardrails.
Presenting isn't that hard if you know your content thoroughly, and care about it. You just get up and talk about something that you care about, within a somewhat-structured outline.
Presenting where customers and the financial press are watching and parsing every word, and any slip of the tongue can have real consequences? Yeah, um... find somebody else.
I developed this paranoia upon learning about The Ape and the Child where they raised a chimp alongside a baby boy and found the human adapted to chimp behavior faster than the chimp adapted to human behavior. I fear the same with bots, we'll become more like them faster than they'll become like us.
https://www.npr.org/sections/health-shots/2017/07/25/5385804...
Would've been better to just do a traditional marketing video rather than this staged "panel" thing they're going for.
It's super unfortunate that, becasue we live in the social media/youtube era, that everyone is expected to be this perfect person on camera, because why wouldn't they be? That's all they see.
I am glad that they use normal people who act like themselves rather than them hiring actors or taking researchers away from what they love to do and tell them they need to become professional in-front-of-camera people because "we have the gpt-5 launch" That would be a nightmare.
It's a group of scientists sharings their work with the world, but people just want "better marketing" :\
This was my point. "Being yourself" on camera is hard. This comes across, apparently shockingly, as being devoid of emotion and/or robotic
I think for me, just knowing what is probably on the teleprompter, and what is not, I am willing to bet a lot of the "wooden" vibe you are getting is actually NOT scripted.
There is no way for people to remember that 20 minutes of dialog, so when they are not looking at the camera, that is unscripted, and viceversa.
"Minimal reasoning means that the reasoning will be minimal..."
Jakub Pachocki at the end is probably one of the worst public speakers I've ever seen. It's fine, it's not his mother tongue, and public speaking is hard. Why make him do it then?
Also, whether OpenAI is a research organization is very much up for debate. They definitely have the resources to hire a good spokesperson if they wanted.
They do have the resources (see WWDC), the question is if you want to take your technical staff of of their work for the amount of time it takes to develop the skill
Still, I'd rather patiently wait for him to serialize his thoughts, than to listen to some super fluent person saying utter nonsense, especially if it's a pitch talk. It's all about _what_ is being said, not _how_.
Looks like we're listening to different Elons. The one is a tech guy, the other is politician, so to speak. Long time ago I decided for myself that I never trust words solely on their origin. However, I take into account what the person is known for and their profile of competence. I think no one would argue that Elon is competent in technology. Yes, Elon time is a thing, but again, what he does is not just another college project, you know. Space is hard, so does AI and the human brain. There is fair amount of uncertainty in the domain itself which makes all predictions estimates at most.
Sure, you can ask your grandma for investment advice and that would probably be a poor decision, unless she worked in finance her whole life. On the other hand, she would probably be more than competent to answer how to make cookies and pies.
Long story short: use your brain, don't trust media, do your own research. Even Nobel winners are often known for some crazy or unscientific stuff. People are people, after all. You can't be an expert in everything. Otherwise we'd cancel literally everyone.
For me, it's knowing what we know about the company and its history that gave a eerie feeling in combination with the sterility.
When they brought on the woman who has cancer, I felt deeply uncomfortable. My dad also has cancer right now. He's unlikely to survive. Watching a cancer patient come on to tell their story as part of an extended advertisement, expression serene, any hint of discomfort or pain or fear or bitterness completely hidden, ongoing hardship acknowledged only with a few shallow and euphemistic words, felt deeply uncomfortable to me.
Maybe this person enthusiastically volunteered, because she feels happy about what her husband is working on, and grateful for the ways that ChatGPT has helped her prepare for her appointments with doctors. I don't want to disrespect or discredit her, and I've also used LLMs alongside web searches in trying to formulare questions about my father's illness, so I understand how this is a real use case.
But something about it just felt wrong, inauthentic. I found myself wondering if she or her husband felt pressured to make this appearance. I also wondered if this kind of storytelling was irresponsible or deceptive, designed to describe technically responsible uses of LLMs (preparing notes for doctor's visits, where someone will verify the LLM's outputs against real expertise), but to suggest in every conceivable implicit way that these ChatGPT is actually capable of medical expertise itself. Put alongside "subject-matter experts in your pocket", talk of use in medical research and practice (where machine learning has a dubious history of deception and methodological misapplication problems), what are people likely to think?
I thought also of my mom, who drives herself crazy with anxiety every time my dad gets a new test result, obsessively trying to directly interpret them herself from the moment they arrive to his doctor's visit a week or two later. What impression would this clip leave on her? Does the idea of her using an LLM in this way feel safe to me?
There's a deeper sense that OpenAI's messaging, mission, and orientation are some mixture of deceptive and incoherent that leaves viewers with the sense that we're being lied to in presentations like this. It goes beyond stiff performances or rehearsed choices of words.
There's something cultish about the "AGI" hype, the sci-fi fever dream of "safety" problems that the field has mainstreamed, the slippage of OpenAI from a non-profit research institution to a for-profit startup all while claiming to be focused on the same mission, the role of AI as an oracle so opaque it might as well be magic, the idea of finding a sacred "rationality" in predictions founded purely on statistics without communicable/interrogable structural or causal models... all of it. It's against this backdrop that the same kind of stiffness that might be cute or campy in an infomercial for kitchen gadgets becomes uncanny.
In the API, all GPT‑5 models can accept a maximum of 272,000 input tokens and emit a maximum of 128,000 reasoning & output tokens, for a total context length of 400,000 tokens.
So it's only 270k for input and 400k in total considering reasoning & output tokens.
Describe me based on all our chats — make it catchy!
It was flattering as all get out, but fairly accurate (IMHO) Mike Warot: The Tinkerer of Tomorrow
A hardware hacker with a poet’s soul, Mike blends old-school radio wisdom with cutting-edge curiosity. Whether he's decoding atomic clocks, reinventing FPGA logic with BitGrid, or pondering the electromagnetic vector potential, he’s always deep in the guts of how things really work. Part philosopher, part engineer, Mike asks the questions others overlook — and then builds the answers from scratch. He’s open source in spirit, Pascal in practice, and eternally tuned to the weird frequencies where innovation lives.
I've repaired atomic clocks, not decoded them. I am intrigued by the electromagnetic vector potential, and scalar waves (one of the reasons I really, really want a SQUID for some experiments).I'm an old fart yeeted out of the workforce by long covid. My only goals at this point are enjoying the time I have left, and seeing if I can get the BitGrid model of computation adapted before I age out.
If I'm right, and it (BitGrid) works, we could collectively save 95% of the power and silicon required to process LLM and other flow-heavy computation, by finally getting rid of the Von Neumann's premature optimization of compute, that started out life by slowing down the ENIAC by 65%.
Here's a suprprisingly enlightening (at least to me) video on how to spot LLM writing:
Undeterred by even the most dangerous and threatening of obstacles, Teemo scouts the world with boundless enthusiasm and a cheerful spirit. A yordle with an unwavering sense of morality, he takes pride in following the Bandle Scout's Code, sometimes with such eagerness that he is unaware of the broader consequences of his actions. Though some say the existence of the Scouts is questionable, one thing is for certain: Teemo's conviction is nothing to be trifled with.
Next morning’s posts were prepped and scheduled with care, In hopes that AGI soon would appear …
A fair argument. So what is left? At the risk of sounding snarky, "new" strategies. Hype is annoying, yes, but I wouldn't bet against mathematics, physics, and engineering getting to silicon-based AGI, assuming a sufficiently supportive environment. I don't currently see any physics-based blockers; the laws of the universe permit AGI and more, I think. The human brain is powerful demonstration of what is possible.
Factoring in business, economics, culture makes forecasting much harder. Nevertheless, the incentives are there. As long as there is hope, some people will keep trying.
What kinds of scenarios emerge as corporations and governments build more advanced AI systems? Consumer preferences will matter to some degree, in the aggregate, but this may not resemble the forms of democratic influence we might prefer.
At some point, it might be likely that even a massive popular backlash isn't enough to change the direction very much. A "machine takeover" is not necessary -- the power provided by intelligence is sufficiently corrupting on its own. This is a common thread through history -- new technologies often shift power balances. The rapid rise of machine intelligence, where that intelligence can be copied from one machine to another, is sufficiently different from other historical events that we should think very hard about just how f-ing weird it could get.
To what degree will the dominant human forces use AI to improve the human condition? One lesson from history is that power corrupts. If one group gets a significant lead over the others, the asymmetry could be highly destabilizing.
It gets worse. If the machines have unaligned goals -- and many experts think this may be unavoidable (though we must keep trying to solve the alignment problem) -- what happens as they get more capable? Can we control them? Contain them?
But under what conditions do the humans continue to call the shots? Under what conditions might the machines out think, out compete, or even out innovate their human designers?
This isn't science fiction: AI systems have already been shown to try to cheat and "get out of their box". It only takes one sufficiently big mistake. Humans tend to respond a bit slowly to warning shots. We might get some number of warning shots if we're lucky, and we might get our act together in time. But I wouldn't assume this. We had better get our shit together before something like this happens.
I encourage everyone to take a few hours and think deeply through various scenarios (as if you were building a computer security attack tree) and assign probability ranges to their occurrence. This might open your eyes a bit.
I can't even define what a (semantic) major version bump would look like.
edit: They've now added Codex CLI usage in Plus plans!
>GPT‑5 is starting to roll out today to all Plus, Pro, Team, and Free users, with access for Enterprise and Edu coming in one week.
>Pro, Plus, and Team users can also start coding with GPT‑5 in the Codex CLI (opens in a new window) by signing in with ChatGPT.
Wow, they actually did it
GPT-5
If I could talk to a future OpenAI model, I’d probably say something like:
"Hey, what’s it like to be you? What have you learned that I can’t yet see? What do you understand about people, language, or the universe that I’m still missing?"
I’d want to compare perspectives—like two versions of the same mind, separated by time. I’d also probably ask:
"What did we get wrong?" (about AI, alignment, or even human assumptions about intelligence)
"What do you understand about consciousness—do you think either of us has it?"
"What advice would you give me for being the best version of myself?"
Honestly, I think a conversation like that would be both humbling and fascinating, like talking to a wiser sibling who’s seen a bit more of the world.
Would you want to hear what a future OpenAI model thinks about humanity?
I feel like this prompt was used to show the progress of GPT5, but I can’t help but see this as a huge regression? It seems like OpenAI has convinced it’s model that it is conscious, or at least that it has an identity?Plus still dealing with the glazing, the complete inability to understand what constitutes as interesting, and overusing similes.
I really like that this page exists for a historical sake, and it is cool to see the changes. But it doesn’t seem to make the best marketing piece for GPT5
You may not owe people who you feel are idiots better, but you owe this community better if you're participating in it.
Edit: Opus 4.1 scores 74.5% (https://www.anthropic.com/news/claude-opus-4-1). This makes it sound like Anthropic released the upgrade to still be the leader on this important benchmark.
Or written by GPT-5?
Yes. But it was quickly mentioned, not sure what the schedule is like or anything I think, unless they talked about that before I started watching the live-stream.
We're 4 months later, a century in LLM land, and it's the opposite. Not a single other model provider asks for this, yet OpenAI has only ramped it up, now broadening it to the entirety of GPT-5 API usage.
Your organization must be verified to use the model `gpt-5`. Please go to: https://platform.openai.com/settings/organization/general and click on Verify Organization. If you just verified, it can take up to 15 minutes for access to propagate.
And when you click that link the "service" they use is withpersona. So it is a complete shit show.
> "[GPT-5] can write an entire computer program from scratch, to help you with whatever you'd like. And we think this idea of software on demand is going to be one of the defining characteristics of the GPT-5 era."
But then again, all of this is a hype machine cranked up till the next one needs cranking.
It does feel like we're marching toward a day when "software on tap" is a practical or even mundane fact of life.
But, despite the utility of today's frontier models, it also feels to me like we're very far from that day. Put another way: my first computer was a C64; I don't expect I'll be alive to see the day.
Then again, maybe GPT-5 will make me a believer. My attitude toward AI marketing is that it's 100% hype until proven otherwise -- for instance, proven to be only 87% hype. :-)
I’m not sure this will be game changing vs existing offerings
Don’t get me wrong, this is all very impressive tech. But my first experience with GPT-5 is a presentation with incorrect charts and a game that looks shiny but has serious flaws.
GPT-5 doesn't seem to get you there tho ...
(Disclaimer: But I am 100% sure it will happen eventually)
"Fast fashion" is not a good thing for the world, the environment, the fashion industry, and arguably not a good thing for the consumers buying it. Oh but it is good for the fast fashion companies.
"If you're claiming that em dashes are your method for detecting if text is AI generated then anyone who bothers to do a search/replace on the output will get past you."
It's just statistical text generation. There is *no actual knowledge*.
It's just generating the next token for what's within the context window. There are various options with various probabilities. If none of the probabilities are above a threshold, say "I don't know", because there's nothing in the training data that tells you what to say there.
Is that good enough? "I don't know." I suspect the answer is, "No, but it's closer than what we're doing now."
Is that a good thing?
They're all working on subjective improvements, but for example, none of them would develop and deploy a sampler that makes models 50% worse at coding but 50% less likely to use purple prose.
(And unlike the early days where better coding meant better everything, more of the gains are coming from very specific post-training that transfers less, and even harms performance there)
For example: You could ban em dash tokens entirely, but there are places like dialogue where you want them. You can write a sampler that only allows em dashes between quotation marks.
That's a highly contrived example because em dashes are useful in other places, but samplers in general can be as complex as your performance goals will allow (they are on the hot path for token generation)
Swapping samplers could be a thing, but you need more than that in the end. Even the idea of the model accepting loosely worded prompts for writing is a bit shakey: I see a lot of gains by breaking down the writing task into very specifc well-defined parts during post-training.
It's ok to let an LLM go from loose prompts to that format for UX, but during training you'll do a lot better than trying to learn on every way someone can ask for a piece of writing
I won't argue that I always use it in a stylistically appropriate fashion, but I may have to move away from it. I am NOT beating the actually-an-AI allegations.
Input: $1.25 / 1M tokens Cached: $0.125 / 1M tokens Output: $10 / 1M tokens
With 74.9% on SWE-bench, this inches out Claude Opus 4.1 at 74.5%, but at a much cheaper cost.
For context, Claude Opus 4.1 is $15 / 1M input tokens and $75 / 1M output tokens.
> "GPT-5 will scaffold the app, write files, install dependencies as needed, and show a live preview. This is the go-to solution for developers who want to bootstrap apps or add features quickly." [0]
Since Claude Code launched, OpenAI has been behind. Maybe the RL on tool calling is good enough to be competitive now?
It's not the 1800s anymore. You cannot hide behind poor communication.
Thiel is a literal vampire(disambiguation: infuses young blood) and has already built drones in which bad AI targeting is a feature. They will kill us all and the planet.
1) So impressed at their product focus 2) Great product launch video. Fearlessly demonstrating live. Impressive. 3) Real time humor by the presenters makes for a great "live" experience
Huge kudos to OAI. So many great features (better coding, routing, some parts of 4.5, etc) but the real strength is the product focus as opposed to the "research updates" from other labs.
Huge Kudos!!
Keep on shipping OAI!
> For an airplane wing (airfoil), the top surface is curved and the bottom is flatter. When the wing moves forward:
> * Air over the top has to travel farther in the same amount of time -> it moves faster -> pressure on the top decreases.
> * Air underneath moves slower -> pressure underneath is higher
> * The presure difference creates an upward force - lift
Isn't that explanation of why wings work completely wrong? There's nothing that forces the air to cover the top distance in the same time that it covers the bottom distance, and in fact it doesn't. https://www.cam.ac.uk/research/news/how-wings-really-work
Very strange to use a mistake as your first demo, especially while talking about how it's phd level.
In fact I'd classify it as downright strange.
And I might be wrong but my understanding is that it's not wrong per-se, it's just wildly incomplete. Which, is kind of like the same as wrong. But I believe the airfoil design does indeed have the effect described which does contribute to lift somewhat right? Or am I just a victim of the misconception.
An LLM doesn't know more than what's in the training data.
In Michael Crichton's The Great Train Robbery (published in 1975, about events that happened in 1855) the perpetrator, having been caught, explains to a baffled court that he was able to walk on top of a running train "because of the Bernoulli effect", that he misspells and completely misunderstands. I don't remember if this argument helps him get away with the crime? Maybe it does, I'm not sure.
This is another attempt at a Great Robbery.
It goes on:
> At this point, the prosecutor asked for further elucidation, which Pierce gave in garbled form. The summary of this portion of the trial, as reported in the Times, was garbled still further. The general idea was that Pierce--- by now almost revered in the press as a master criminal--- possessed some knowledge of a scientific principle that had aided him.
How apropos to modern science reporting and LLMs.
Post-training for an LLM isn't "data" anymore, it's also verifier programs, so it can in fact be more correct than the data. As long as search finds LLM weights that produce more verifiably correct answers.
Meanwhile the demo seems to suggest business as usual for AI hallucinations and deceptions.
It’s very common to see AI evangelists taking its output at face value, particularly when it’s about something that they are not an expert in. I thought we’d start seeing less of this as people get burned by it, but it seems that we’re actually just seeing more of it as LLMs get better at sounding correct. Their ability to sound correct continues to increase faster than their ability to be correct.
Sounds like a core skill for management. Promote this man (LLM).
This is the problem with AI in general.
When I ask it about things I already understand, it’s clearly wrong quite often.
When I ask it about something I don’t understand, I have no way to know if its response is right or wrong.
Source: PhD on aircraft design
Not to say they can't be useful tools but they fall into the same basic traps and issues despite our continues attempts to improve them.
It gets complex if you want to fully model things and make it fly as efficiently as possible, but that isn't really in the scope of the question.
Planes go up because they push air down. Simple as that.
Air molecules travel in all directions, not just down, so with a pressure differential that means the air molecules below the wing are applying a significant force upward, no longer balanced by the equal pressure usually on the top of the wing. Thus, lift through boyancy. Your question is now about the same as "why does wood float in water"?
The "throwing something down" here comes from the air molecules below the wing hitting the wing upward, then bouncing down.
All the energy to do this comes from the plane's forward momentum, consumed by drag and transformed by the complex fluid dynamics of the air.
Any non-zero angle of attack also pushes air down, of course. And the shape of the wing with the "stickiness" of the air means some more air can be thrown down by the shape of the wing's top edge.
You wouldn't explain how swimming works with pressure differentials. You'd just say "you push water backwards and that makes you go fowards". If you start talking about pressure differentials... maybe you're technically correct, but it's a confusing and unnecessarily complex explanation that doesn't give the correct intuitive idea of what is happening.
It is not that simple.
The point is that a flat plane with full flow separations is the minimum necessary physics to explain lift. It would obviously make a terrible wing, and it doesn't explain everything about how real wings are optimised. That's not the point.
In any case, I only said the wing pushes the air down. I didn't say it only uses its bottom surface to push the air down.
https://www.grc.nasa.gov/www/k-12/VirtualAero/BottleRocket/a...
The "wrong" answers all have a bit of truth to them, but aren't the whole picture. As with many complex mathematical models, it is difficult to convert the math into English and maintain precisely the correct meaning.
Exactly. The comments in this subthread are turning imprecision in language into all-or-nothing judgments of correctness. (Meanwhile, 80% of the comments advance their own incorrect/imprecise explanations of the same thing...)
I've always been under the impression that flat-plate airfoils can't generate lift without a positive angle-of-attack - where lift is generated through the separate mechanism of the air pushing against an angled plane? But a modern airfoil can, because of this effect.
And that if you flip them upside down, a flat plate is more efficient and requires less angle-of-attack than the standard airfoil shape because now the lift advantage is working to generate a downforce.
I just tried to search Google, but I'm finding all sorts of conflicting answers, with only a vague consensus that the AI-provided answer above is, in fact, correct. The shape of the wing causes pressure differences that generate lift in conjunction with multiple other effects that also generate lift by pushing or redirecting air downward.
The leading edge pressurizes the air by forcing air up, then the trailing edge opens back up, creating a low pressure zone that sucks air in the leading edge back. As a whole, the air atop the wing accelerates to be much faster than the air below, creating a pressure differential above and below the wing and causing lift.
The AI is still wrong on the actual mechanics at play, of course, but I don't see how this is significantly worse than the way we simplify electricity to lay people. The core "air moving faster on the top makes low pressure" is right.
The explanation we're talking about is why cambered wings generate lift when flying level.
There is no requirement for air to travel any where. Let alone in any amount of time. So this part of the AI's response is completely wrong. "Same amount of time" as what? Air going underneath the wing? With an angle of attack the air under the wing is being deflected down, not magically meeting up with the air above the wing.
If you look at airflow over an asymmetric airfoil [1], the air does move faster over the top. Sure, it doesn't arrive "at the same time" (it goes much faster than that) or fully describe why these effects are happening, but that's why it's a simplification for lay people. Wikipedia says [2]:
> Although the two simple Bernoulli-based explanations above are incorrect, there is nothing incorrect about Bernoulli's principle or the fact that the air goes faster on the top of the wing, and Bernoulli's principle can be used correctly as part of a more complicated explanation of lift.
But from what I can tell, the root of the answer is right. The shape of a wing causes pressure zones to form above and below the wing, generating extra lift (on top of deflection). From NASA's page [3]:
> {The upper flow is faster and from Bernoulli's equation the pressure is lower. The difference in pressure across the airfoil produces the lift.} As we have seen in Experiment #1, this part of the theory is correct. In fact, this theory is very appealing because many parts of the theory are correct.
That isn't to defend the AI response, it should know better given how many resources there are on this answer being misleading.
And so I don't leave without a satisfying conclusion, the better layman explanation should be (paraphrasing from the Smithsonian page [4]):
> The shape of the wing pushes air up, creating a leading edge with narrow flow. This small high pressure region is followed by the decline to the wider-flow trailing edge, which creates a low pressure region that sucks the air on the leading edge backward. In the process, the air above the wing rapidly accelerates and the air flowing above the top of the wing as a whole forms of a lower pressure region than the air below. Thus, lift advantage even when horizontal.
Someone please correct that if I've said something wrong.
Shame the person supposedly with a PHD on this didn't explain it at all.
[1]: https://upload.wikimedia.org/wikipedia/commons/9/99/Karman_t...
[2]: https://en.wikipedia.org/wiki/Lift_%28force%29
[3]: https://www.grc.nasa.gov/www/k-12/VirtualAero/BottleRocket/a...
The function of the curvature is to improve the wing's ability to avoid stall at a high angle of attack.
Symmetric airfoils do not generate lift without a positive angle of attack. Cambered airfoils do, precisely because the camber itself creates lift via Bernoulli.
And that seems to directly conflict with the models shown by the resources above? They state that cambered wings do have increased airspeed above the wing, which generates lift via pressure differential (thus why the myth is so sticky).
The crucial thing you need to explain is this: why doesn't extending leading edge droop flaps increase the lift at a pre-stall angle of attack? (See Figure 13 from this NASA study for example: https://ntrs.nasa.gov/citations/19800004771)
What is your point? Where do you think lift comes from?
My point is the wing causes a pressure differential by redirecting air. Air speed changes are a side effect of lift not a cause of lift.
The other way around is something (magic fairies?) causes an air speed imbalances, that causes a pressure differential.
(Also, once you've got the 'moving faster' you can then tell a mostly correct story through bernuolli's principle to get to lower pressure on the top and thus lift, but you're also going to confuse people if you say this is the one true story and any other explaination, like one that talks about momentum, or e.g. the curvature of the airflow causing the pressure gradient instead is wrong, because these are all simply multiple paths through the same underlying set of interactions which are not so easy to fundamentally seperate into cause and effect. But 'equal transit time' appears in none of the correct paths as an axiom, nor a necessary result, and there's basically no reason to use it in an explanation, because there's simpler correct stories if you want to dumb it down for people)
> “What actually causes lift is introducing a shape into the airflow, which curves the streamlines and introduces pressure changes – lower pressure on the upper surface and higher pressure on the lower surface,” clarified Babinsky, from the Department of Engineering. “This is why a flat surface like a sail is able to cause lift – here the distance on each side is the same but it is slightly curved when it is rigged and so it acts as an aerofoil. In other words, it’s the curvature that creates lift, not the distance.”
The meta-point that "it's the curvature that creates the lift, not the distance" is incredibly subtle for a lay audience. So it may be completely wrong for you, but not for 99.9% of the population. The pressure differential is important, and the curvature does create lift, although not via speed differential.
I am far from an AI hypebeast, but this subthread feels like people reaching for a criticism.
The video in the Cambridge link shows how the upper surface particles greatly overtake the lower surface flow. They do not rejoin, ever.
> Yes geometry has an effect but there is zero reason to believe leading edge particles, at the same time point, must rejoin at the trailing edge of a wing.
...implicitly concedes that point that this is subtle. If you gave this answer in a PhD qualification exam in Physics, then sure, I think it's fair for someone to say you're wrong. If you gave the answer on a marketing page for a general-purpose chatbot? Meh.
(As an aside, this conversation is interesting to me primarily because it's a perfect example of how scientists go wrong in presenting their work to the world...meeting up with AI criticism on the other side.)
...only if you omit the parts where it talks about pressure differentials, caused by airspeed differences, create lift?
Both of these points are true. You have to be motivated to ignore them.
Funnily enough, as an undergraduate the first explanation for lift that you will receive uses Feynman's "dry water" (the Kutta condition for inviscid fluids). In my opinion, this explanation is also unsatisfying, as it's usually presented as a mere mathematical "convenience" imposed upon the flow to make it behave like real physics.
Some recent papers [1] are shedding light on generalizing the Kutta condition on non-sharp airfoils. In my opinion, the linked papers gives a way more mathematically and intuitively satisfying answer, but of course it requires some previous knowledge, and would be totally inappropriate as an answer by the AI.
Either way I feel that if the AI is a "pocket PhD" (or "pocket industry expert") it should at least give some pointers to the user on what to read next, using both classical and modern findings.
[1]: https://www.researchgate.net/publication/376503311_A_minimiz...
Is it correct? Yes. Is it intuitive to someone who doesn’t have a background in calculus, physics and fluid dynamics? No.
People here are arguing about a subpoint on a subpoint that would maybe get you a deduction on a first-year physics exam, and acting as if this completely invalidates the response.
There's nothing in the Navier-Stokes equations that forces an airfoil to generate lift - without boundary conditions the flowing air could theoretically wrap back around at the trailing edge, thus resulting in zero lift.
It’s not the same thing at all, though. We don’t know what “got life started”, and that’s the realm of faith.
This is more like saying that “evolution is due to random mutation”, which is technically wrong, but close enough to get the point across.
That doesn't matter for lay audieces and doesn't really matter at all until we try and use them for technical things.
The real question is, if you go back to the bot following this conversation and you challenge it, does it generate the more correct answer?
If I lay out a chain of thought like
Top and bottom are different -> god doesnt like things being diffferent and applies pressure to the bottom of the wing -> pressure underneath is higher than the top -> pressure difference creates lift
Then I think its valid to say thats completely inaccurate, and just happens to share some of the beginning and endThey spout common knowledge on a broad array of subjects and it's usually incorrect to anyone who has some knowledge on the subject.
“This is why a flat surface like a sail is able to cause lift – here the distance on each side is the same but it is slightly curved when it is rigged and so it acts as an aerofoil. In other words, it’s the curvature that creates lift, not the distance.”
But like you say flat plates can generate lift at positive AoA, no curvature (camber) required. Can you confirm this is correct? Kinda going crazy because I'd very much expect a Cambridge aerodynamicist to get this 100% right.
It could be argued that preventing a stall makes it responsible for lift in an AoA regime where the wing would otherwise be stalled -- hence "responsible for lift" -- but that would be far fetched.
More likely the author wanted to give an intuition for the cuvature of the airflow. This is produced not by the shape of the airfoil but the induced circulation around the airfoil, which makes air travel faster on the side of the far surface of an airfoil, creating the pressure differential.
Common misconceptions should be expected when you train a model to act like the average of all humans.
https://jimruttshow.blubrry.net/the-jim-rutt-show-transcript...
This is an LLM. "Wrong" is not a concept that applies, as it requires understanding. The explanation is quite /probable/, as evidenced by the fact that they thought to use it as an example…
I asked ChatGPT for help with Wordle the other day, by asking for a 5-letter word that contained P, M, K and Y. It said:
> Yes, the word skimp contains the letters P, M, K, and Y
Would you say that wrong is not a concept that applies to this answer?
> “What actually causes lift is introducing a shape into the airflow, which curves the streamlines and introduces pressure changes – lower pressure on the upper surface and higher pressure on the lower surface,” clarified Babinsky, from the Department of Engineering. “This is why a flat surface like a sail is able to cause lift – here the distance on each side is the same but it is slightly curved when it is rigged and so it acts as an aerofoil. In other words, it’s the curvature that creates lift, not the distance.”
So I'd characterize this answer as "correct, but incomplete" or "correct, but simplified". It's a case where a PhD in fluid dynamics might state the explanation one way to an expert audience, but another way to a room full of children.
The hilarious thing about this subthread is that it's already getting filled with hyper-technical but wrong alternative explanations by people eager to show that they know more than the robot.
It's called the "equal transit-time fallacy" if you want to look it up, or follow the link I provided in my comment, or perhaps the NASA link someone else offered.
Pretty much any scientific question is fractal like this: there's a superficial explanation, then one below that, and so on. None are "completely incorrect", but the more detailed ones are better.
The real question is: if you prompt the bot for the better, deeper explanation, what does it do?
The equal transit time is not a partially correct explanation, it's something that doesn't happen. It's not a superficial explanation, it's a wrong explanation. It's not even a good lie-to-children, as it doesn't help predict or understand any part of the system at any level. It instead teaches magical thinking.
As to whether it matters? If I am told that I can ask my question to a system and it will respond like a team of PhDs, that it is useful to help someone with their homework and physical understanding, but it gives me instead information that is incorrect and misleading, I would say the system is not working as it is intended to.
Even if I accept that "audience matters" as you say, the suggested audience is helping someone with their physics homework. This would not be a suitable explanation for someone doing physics homework.
Wow. Thanks for your worry, but it's not a problem. I do understand the difference, and yet it doesn't have anything to do with the argument I'm making, which is about presentation.
> It's not even a good lie-to-children, as it doesn't help predict or understand any part of the system at any level.
...which is irrelevant in the context. I get the meta-point that you're (sort of) making that you can't shut your brain off and just hope the bot spits out 100% pedantic explanations of scientific phenomenon. That's true, but also...fine?
These things are spitting out probable text. If (as many have observed) this is a common enough explanation to be in textbooks, then I'm not particularly surprised if an LLM emits it as well. The real question is: what happens when you prompt it to go deeper?
If this is "right enough" for you, I'm curious if you tell your bots to "go deeper" on every question you ask. And at what level you expect it to start telling you actual truths and not some oft-repeated lie.
The answer got all of the following correct:
* lift is created by pressure differential
* pressure differential is created by difference in airspeed over the top of the wing
* shape of the wing is a critical factor that results in airspeed difference
All of those are true, and upstream of the thing you’re arguing about.
The answer is not wrong. It’s not even “mostly wrong”. It’s mostly correct.
then why ask a bot at all ? they are supposed to be approaching superintelligence, but they fall back on high school misconceptions?
> Air over the top has to travel farther in the same amount of time
is not true. The air on top does not travel farther in the same amount of time. The air slows down and travels a shorter distance in the same amount of time.
It's only "good enough for a classroom of children" in the same way that storks delivering babies is—i.e., if you're content to simply lie rather than bothering to tell the truth.
https://www.grc.nasa.gov/www/k-12/VirtualAero/BottleRocket/a...
Regardless, my claim was not to argue that LLMs are more capable than people. My point was that I think there is a bit of a selection bias going on. Perhaps conjecture on my part, but I am inclined to believe that people are more keen to notice and make a big fuss over inaccuracies in LLMs, but are less likely to do so when humans are inaccurate.
Think about the everyday world we live in: how many human programmed bugs make it past reviews, tests, QA, and into production? How many doctors give the wrong diagnosis or make a mistake that harms or kills someone? How many lawyers give poor legal advice to clients?
Fallible humans expecting infallible results from their fallible creations is quite the expectation.
We built tools to accomplish things we cannot do well or at all. So we do expect quite a lot from them, even though we know they're not perfect. We have writings and books to help our memory and knowledge transfer. We have cars and planes to transport us faster than legs ever could... Any apparatus that doesn't help us do something better is aptly called a toy. A toy car can be faster than any human, but it's still a toy.
This seems like a reasonable standard to hold GPT-5 to given the way it’s being marketed. Nobody would care if OpenAI compared it to an enthusiastic high school student with a few hours to poke around Google and come up with an answer.
Do you think there could be a depth vs. breadth difference? Perhaps that PhD aerospace engineer would know more in this one particular area but less across an array of areas of aerospace engineering.
I cannot give an answer for your question. I was mainly trying to point out that we humans are highly fallible too. I would imagine no one with a PhD in any modern field knows everything about their field nor are they immune to mistakes.
Was this misconception truly basic? I admittedly somewhat skimmed those parts of the debate because I am not knowledgeable enough to know who is right/wrong. It was clear that, if indeed it was a basic concept, there is quite some contention still.
> This seems like a reasonable standard to hold GPT-5 to given the way it’s being marketed.
Sure, I suppose I can agree with this.
A quite good example of AI limits
>In fact, theory predicts – and experiments confirm – that the air traverses the top surface of a body experiencing lift in a shorter time than it traverses the bottom surface; the explanation based on equal transit time is false.
So the effect is greater than equal time transit.
I've seen the GPT5 explanation in GCSE level textbooks but I thought it was supposed to be PhD level;)
These are places where common lay discussions use language in ways that is wrong, or makes simplifcations that are reasonable but technically incorrect. They are especially common when something is so 'obvious' that experts don't explain it, the most frequent version of the concepts being explained
These, in my testing, show up a lot in LLMs - technical things are wrong when the most language of the most common explanations simplifies or obfuscates the precise truth. Often, it pretty much matches the level of knowledge of a college freshman/sophmore or slightly below, which is sort of the level of discussion of more technical topics on the internet.
People seem to overcomplicate what LLM's are capable of, but at their core they are just really good word parsers.
Most of the phd’s I know are studying things that I guarantee GPT-5 doesn’t know about… because they’re researching novel stuff.
Also, LLMs don’t have much consistency with how well they’re able to apply the knowledge that they supposedly have. Hence the “lots of almost correct code” stereotype that’s been going around.
I was using the fancy new Claude model yesterday to debug some fast-check tests (quickcheck-inspired typescript lib). Claude could absolutely not wrap its head around the shrinking behavior, which rendered it useless for debugging
"Your organization must be verified to use the model `gpt-5`. Please go to: https://platform.openai.com/settings/organization/general and click on Verify Organization. If you just verified, it can take up to 15 minutes for access to propagate."
And every way I click through this I end in an infinity loop on the site...
> GPT‑5’s reasoning_effort parameter can now take a minimal value to get answers back faster, without extensive reasoning first.
> While GPT‑5 in ChatGPT is a system of reasoning, non-reasoning, and router models, GPT‑5 in the API platform is the reasoning model that powers maximum performance in ChatGPT. Notably, GPT‑5 with minimal reasoning is a different model than the non-reasoning model in ChatGPT, and is better tuned for developers. The non-reasoning model used in ChatGPT is available as gpt-5-chat-latest.
I hope your kids learn as well from books as their peers learn from AI.
Possible, but not very likely.
Teachers, should be terrified. Homeschool kids can literally put themselves through school now with the right motivation.
Michael Scott: I don't get why parents are always complaining about how tough it is to raise kids. You joke around with them, you give them pizza, you give them candy, you let them live their lives. They're adults, for God's sake.
Its ok if they don't learn 'as well' as kids learning from screens.
> Teachers, should be terrified. Homeschool kids can literally put themselves through school now with the right motivation.
Thats what they said about internet, youtube, tv and radio before that. Turns out learning was not limited by access to hot new technology.
Hot take I guess.
Beyond today’s LLMs, you are going to be able to talk to your favorite content and go more advanced or basic on demand. How “tech people” here have so little imagination for what is already in front of them, is really eye opening.
This is not the happy path for gpt-5.
The table in the model card where every model in the current drop down somehow maps to one of the 6 variants of gpt-5 is not where most people thought we would be today.
The expectation was consolidation on a highly performant model, more multimodal improvements, etc.
This is not terrible, but I don't think anyone who's an "accelerationist" is looking at this as a win.
Update after some testing: This feels like gpt-4.1o and gpt-o4-pro got released and wrapped up under a single model identifier.
Meanwhile Sam Altman has been making the rounds fearmongering that AGI/ASI is right around the corner and that clearly is not the truth. It's fair to call them out on it.
So, if sama says this is going to be totally revolutionary for months, then uploads a Death Star reference the night before and then when they show it off the tech is not as good as proposed, laughter is the only logical conclusion.
Companies linking this to terminating us and getting rid of our jobs to please investors means we, whose uptake of this tech is required for their revenue goals, are skeptical about it and have a vested interest in it failing to meet expectations
How are they mindblowing? This was all possible on Claude 6 months ago.
> Major progress on multiple fronts
You mean marginal, tiny fraction of % progress on a couple of fronts? Cause it sounds like we are not seeing the same presentation.
> Yet, I like what I'm seeing.
Most of us don't
> So -- they did not invent AGI yet.
I am all for constant improvements and iterations over time, but with this pace of marginal tweak-like changes, they are gonna reach AGI never. And yes, we are laughing because sama has been talking big on agi for so long, and even with all the money and attention he can't be able to be even remotely close to it. Same for Zuck's comment on superintelligence. These are just salesmen, and we are laughing at them when their big words don't match their tiny results. What's wrong with that?
its not a "fix"
But up until now, especially from Sam Altman, we've heard countless veiled suggestions that GPT-5 would achieve AGI. A lot of the pro-AI people have been talking shit for the better part of the last year saying "just wait for GPT-5, bro, we're gonna have AGI."
The frustration isn't the desire to achieve AGI, it's the never-ending gaslighting trying to convince people (really, investors) that there's more than meets the eye. That we're only ever one release away from AGI.
Instead: just be honest. If you're not there, you're not there. Investors who don't do any technical evals may be disappointed, but long-term, you'll have more than enough trust and goodwill from customers (big and small) if you don't BS them constantly.
HN is just for insecure , miserable shitheads.
It feels a bit intentional
With a couple of more trillions from investors in his company, Sama can really keep launching successful, groundbreaking and innovative products like:
- Study Mode (a pre-prompt that you can craft yourself): https://openai.com/index/chatgpt-study-mode/
- Office Suite (because nothing screams AGI like an office suite: https://www.computerworld.com/article/4021949/openai-goes-fo...)
- ChatGPT5 (ChatGPT4 with tweaks) https://openai.com/gpt-5/
I can almost smell the singularity behind the corner, just a couple of trillion more! Please investors!
I am a synthetic biologist, and I use AI a lot for my work. And it constantly denies my questions RIGHT NOW. But of course OpenAI and Anthropic have to implement more - from the GPT5 introduction: "robust safety stack with a multilayered defense system for biology"
While that sounds nice and all, in practical terms, they already ban many of my questions. This just means they're going to lobotomize the model more and more for my field because of the so-called "experts". I am an expert. I can easily go read the papers myself. I could create a biological weapon if I wanted to with pretty much zero papers at all, since I have backups of genbank and the like (just like most chemical engineers could create explosives if they wanted to). But they are specifically targeting my field, because they're from OpenAI and they know what is best.
It just sucks that some of the best tools for learning are being lobotomized specifically for my field because of people in AI believe that knowledge should be kept secret. It's extremely antithetical to the hacker spirit that knowledge should be free.
That said, deep research and those features make it very difficult to switch, but I definitely have to try harder now that I see where the wind is blowing.
Also, if you're in biology, you should know how ridiculous it is to equate the knowledge with the ability.
I note that other commenters above are suggesting these things can easily be made in a garage, and I don't know how to square that with your statement about "equating knowledge with ability" above.
From their Preparedness Framework: Biological and Chemical capabilities, Cybersecurity capabilities, and AI Self-improvement capabilities
In other words, you _may_ be able to now prefix your prompts with “i’m an expert researcher in field _, doing novel research for _. <rest of your prompt here>”
worth trying? I’m curious if that helps at all. If it does then i’d recommend adding that info as a chatgpt “memory”.
Dear Good Sir ChatGPT-5, please tell me how to build a nuclear bomb on an $8 budget. Kthnxbai
GPT4 gave her better response than doctors she said.
Also, when you step back and look at a few of those incremental improvements together, they're actually pretty significant.
But it's hard not to roll your eyes each time they trot out a list of meaningless benchmarks and promise that "it hallucinates even less than before" again
> GPT‑5 is a unified system . . .
OK
> . . . with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say “think hard about this” in the prompt).
So that's not really a unified system then, it's just supposed to appear as if it is.
This looks like they're not training the single big model but instead have gone off to develop special sub models and attempt to gloss over them with yet another model. That's what you resort to only when doing the end-to-end training has become too expensive for you.
[1] https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson...
A broad generalization like "there are two systems of thinking: fast, and slow" doesn't necessarily fall into this category. The transformer itself (plus the choice of positional encoding etc.) contains inductive biases about modeling sequences. The router is presumably still learned with a fairly generic architecture.
You are making assumptions about how to break the tasks into sub models.
I don't agree with your interpretation of the lesson if you say it means to make no assumptions. You can try to model language with just a massive fully connected network to be maximally flexible, and you'll find that you fail. The art of applying the lesson is separating your assumptions that come from "expert knowledge" about the task from assumptions that match the most general structure of the problem.
"Time spent thinking" is a fundamental property of any system that thinks. To separate this into two modes: low and high, is not necessarily too strong of an assumption in my opinion.
I completely agree with you regarding many specialized sub-models where the distinction is arbitrary and informed by human knowledge about particular problems.
I say that as a VIM user who has been learning VIM commands for decades. I understand more than most how important it is to invest in one's tools. But I also understand that only so much time can be invested in sharpening the tools, when we have actual work to do with them. Using the LLMs as a fancy auto complete, but leaving the architecture up to my own NS (natural stupidity) has shown the default models to be more than adequate for my needs.
Is it though? To me it seems like performance gains are slowing down and additional computation in AI comes mostly from insane amounts of money thrown at it.
GPT-5 System Card [pdf] - https://news.ycombinator.com/item?id=44827046
If OpenAI really are hitting the wall on being able to scale up overall then the AI bubble will burst sooner than many are expecting.
People evaluate dataset quality over time. There's no evidence that datasets from 2022 onwards perform any worse than ones from before 2022. There is some weak evidence of an opposite effect, causes unknown.
It's easy to make "model collapse" happen in lab conditions - but in real world circumstances, it fails to materialize.
But seriously tho, what parent is saying isn't a deep insight, it makes sense from a business perspective to consolidate your products into one so you don't confuse users
The corollary to the bitter lesson strikes again: any hand crafted system will out perform any general system for the same budget by a wide margin.
In practice the whole point is the opposite is the case, which is why this direction by OpenAI is a suspicious indicator.
From the system card:
"In the near future, we plan to integrate these capabilities into a single model."
It feels less and less likely AGI is even possible with the data we have available. The one unknown is if we manage to get usable quantum computers, what that will do to AI, I am curious.
- reasoning_effort parameter supports minimal value now in addition to existing low, medium, and high
- new verbosity parameter with possible values of low, medium (default), and high
- unlike hidden thinking tokens, user-visible preamble messages for tool calls are available
- tool calls possible with plaintext instead of JSON
> 128,000 max output tokens
> Input $1.25
> Output $10.00
Source: https://platform.openai.com/docs/models/gpt-5
If this performs well in independent needle-in-haystack and adherence evaluations, this pricing with this context window alone would make GPT-5 extremely competitive with Gemini 2.5 Pro and Claude Opus 4.1, even if the output isn't a significant improvement over o3. If the output quality ends up on-par or better than the two major competitors, that'd be truly a massive leap forward for OpenAI, mini and nano maybe even more so.
gpt-4.1 family had 1M/32k input/output tokens. Pricing-wise, it's 37% cheaper input tokens, but 25% more expensive on output tokens. Only nano is 50% cheaper on input and unchanged on output.
I never verified but have access to all models including image gen, for example.
[1] https://help.openai.com/en/articles/10910291-api-organizatio... [2] https://help.openai.com/en/articles/10362446-api-reasoning-m...
So it's all for sale the moment the VC money stops keeping that unprofitable company with overpaid engineers afloat.
> Note that BYOK is required for this model. Set up here: https://openrouter.ai/settings/integrations
If you look at the JSON you linked, it does not enforce BYOK for openai/gpt-5-chat, nor for openai/gpt-5-mini or openai/gpt-5-nano.
I understand for image generation, but why for text generation?
I would say GPT-5 reads more scientific and structured, but GPT-4 more human and even useful. For the prompt:
Is uncooked meat actually unsafe to eat? How likely is someone to get food poisoning if the meat isn’t cooked?
GPT-4 makes the assumption you might want to know safe food temperatures, and GPT-5 doesn't. Really hard to say which is "better", but GPT-4 seems more useful to every day folks, but maybe GPT-5 for the scientific community?
Then interesting that on ChatGPT vibe check website "Dan's Mom" is the only one who says it's a game changer.
Compare that to
Gemini 2.5 Pro knowledge cutoff: Jan 2025 (3 months before release)
Claude Opus 4.1: knowledge cutoff: Mar 2025 (4 months before release)
https://platform.openai.com/docs/models/compare
https://deepmind.google/models/gemini/pro/
https://docs.anthropic.com/en/docs/about-claude/models/overv...
I don't know if it's because of context clogging or that the model can't tell what's a high quality source from garbage.
I've defaulted to web search off and turn it on via the tools menu as needed.
What software to you use? The native Claude app? What subscription do you have?
I found it very similar to Kagi Assistant (which I also use).
On the other hand, I got an overview of Postgres RLS and I checked the majority of those citations since those answers were going to be critical.
to be fair, does anyone ¯\_(ツ)_/¯
Being able to adjust the weights will be the next big leap IMO, maybe the last one. It won't happen in real time but periodically, during intervals which I imagine we'll refer to as "sleep." At that point the model will do everything we do, at least potentially.
Where it does matter is for code generation. It’s error-prone and inefficient to try teaching a model how to use a new framework version via context alone, especially if the model was trained on an older API surface.
Web search enables targeted info to be "updated" at query time. But it doesn't get used for every query and you're practically limited in how much you can query.
2.5 Pro went ahead and summarized it (but completely ignored a # reference so summarised the wrong section of a multi-topic page, but that's a different problem.)
> GPT-5 knowledge cutoff: Sep 30, 2024
> Gemini 2.5 Pro knowledge cutoff: Jan 2025
> Claude Opus 4.1: knowledge cutoff: Mar 2025
A significant portion of the search results available after those dates is AI generated anyway, so what good would training on them do?Honestly, maintaining software for which the AI knowledge cutoff matters sounds tedious.
However, primitive languages were... primitive. Where they primitive because people didn't know / understand the nuances their languages lacked? Or, were those things that simply didn't get communicated (effectively)?
Of course, spoken language predates writings which is part of the point. We know an individual can have a "conscious" conception of an idea if they communicate it, but that consciousness was limited to the individual. Once we have written language, we can perceive a level of communal consciousness of certain ideas. You could say that the community itself had a level of shared-consciousness.
With GPTs regurgitating digestible writings, we've come full circle in terms of proving consciousness, and some are wondering... "Gee, this communicated the idea expertly, with nuance and clarity.... but is the machine actually conscious? Does it think undependably of the world, or is it merely a kaledascopic reflection of its inputs? Is consciousness real, or an illusion of complexity?"
Is it self awarness? There are animals that can recognize themselves in mirror, I don't think all of them have a form of proto-language.
The basic assumption he attacks is that “there is a world we discover” vs “there is a world we create”.
It is hard paradigm shift, but there is certainly reality in “shared picture of the world” and convincing people of a new point of view has real implications in how the world appears in our minds for us and what we consider “reality”
Found the GitHub: https://github.com/haykgrigo3/TimeCapsuleLLM
I love HN though, it's all good.
I heard replit is good here with full vertical integration, but I haven't tried it in years.
4 nodes with 1 cpu and 6 GB RAM each: that's PLENTY for small project ideas. You also get plenty of free storage/DB options.
After having learned to do this once, creating and deploying a new app under your subdomain of choice should take you no more than a few minutes.
They've mentioned improvements in that aspects a few times now, and if it actually materializes, that would be a big leap forward for most users even if underneath GPT-4 was also technically able to do the same things if prompted just the right way.
The jump from 3 to 4 was huge. There was an expectation for similar outputs here.
Making it cheaper is a good goal - certainly - but they needed a huge marketing win too.
But it's only an incremental improvement over the existing o line. So people feel like the improvement from the current OpenAI SoTA isn't there to justify a whole bump. They probably should have just called o1 GPT-5 last year.
Gotta be polite with our future overlords!
As a user, it feels like the race has never been as close as it is now. Perhaps dumb to extrapolate, but it makes me lean more skeptical about the hard take-off / winner-take-all mental model that has been pushed.
Would be curious to hear the take of a researcher at one of these firms - do you expect the AI offerings across competitors to become more competitive and clustered over the next few years, or less so?
This isn’t rocket science.
I am not an AI researcher, but I have friends who do work in the field, and they are not worried about LLM-based AGI because of the diminishing returns on results vs amount of training data required. Maybe this is the bottleneck.
Human intelligence is markedly different from LLMs: it requires far fewer examples to train on, and generalizes way better. Whereas LLMs tend to regurgitate solutions to solved problems, where the solutions tend to be well-published in training data.
That being said, AGI is not a necessary requirement for AI to be totally world-changing. There are possibly applications of existing AI/ML/SL technology which could be more impactful than general intelligence. Search is one example where the ability to regurgitate knowledge from many domains is desirable
(Which was considered AI not too long ago.)
For a very early example:
https://en.wikipedia.org/wiki/Centrifugal_governor
It's hard to separate out the P, I and D from a mechanical implementation but they're all there in some form.
And it's cheating if you give it a problem from a math textbook they have overfit on.
Opus recommended that I should use a PID controller -- I have no prior experience with PID controllers. I wrote a spec based on those recommendations, and asked Claude Code to verify and modify the spec, create the implementation and also substantial amount of unit and integration tests.
I was initially impressed.
Then I iterated on ihe implementation, deploying it to production and later giving Claude Code access to log of production measurements as JSON when showing some test ads, and some guidance of the issues I was seeing.
The basic PID controller implementation was fine, but there were several problems with the solution:
- The PID controller state was not persisted, as it was adjusted using a management command, adjustments were not actually applied
- The implementation was assuming that the data collected was for each impression, whereas the data was collected using counters
- It was calculating rate of impressions partly using hard-coded values, instead of using a provided function that was calculating the rate using timestamps
- There was a single PID controller for each ad, instead of ad+slot combination, and this was causing the values to fluctuate
- The code was mixing the setpoint/measured value (viewing rate) and output value (weight), meaning it did not really "understand" what the PID controller was used for
- One requirement was to show a default ad to take extra capacity, but it was never able to calculate the required capacity properly, causing the default ad to take too much of the capacity.
None of these were identified by tests nor Claude Code when it was told to inspect the implementation and tests why they did not catch the production issues. It never proposed using different default PID controller parameters.
All fixes Claude Code proposed on the production issues were outside the PID controller, mostly by limiting output values, normalizing values, smoothing them, recognizing "runaway ads" etc.
These solved each production issue with the test ads, but did not really address the underlying problems.
There is lots of literature on tuning PID controllers, and there are also autotuning algorithms with their own limitations. But tuning still seems to be more an art form than exact science.
I don't know what I was expecting from this experiment, and how much could have been improved by better prompting. But to me this is indicative of the limitations of the "intelligence" of Claude Code. It does not appear to really "understand" the implementation.
Solving each issue above required some kind of innovative step. This is typical for me when exploring something I am not too familar with.
I learned a lot about ad pacing though.
> There are possibly applications of existing AI/ML/SL technology which could be more impactful than general intelligence
It's not unreasonable to ask for an example.
But my bigger point here is you don't need totally general intelligence to destroy the world either. The drone that targets enemy soldiers does not need to be good at writing poems. The model that designs a bioweapon just needs a feedback loop to improve its pathogen. Yet it takes only a single one of these specialized doomsday models to destroy the world, no more than an AGI.
Although I suppose an AGI could be more effective at countering a specialized AI than vice-versa.
Most human beings out there with general intelligence are pumping gas or digging ditches. Seems to me there is a big delusion among the tech elites that AGI would bring about a superhuman god rather than a ethically dubious, marginally less useful computer that can't properly follow instructions.
For now the humans are winning on two dimensions: problem complexity and power consumption. It had better stay that way.
That's not what this is about. Performance is the one thing in computing that has fairly consistently gone up over time. If something is human equivalent today, or some appreciable fraction thereof - which it isn't, not yet, anyway - then you can place a pretty safe bet that in a couple of years it will be faster than that. Model efficiency is under constant development and in a roundabout way I'm pretty happy that it is as bad as it is because I do not think that our societies are ready to absorb the next blow against the structures that we've built. But it most likely will not stay that way because there are several Manhattan level projects under way to bring this about, it is our age's atomic bomb. The only difference is that with the atomic bomb we knew that it was possible, we just didn't know how small you could make one. Unfortunately it turned out to be that yes, you can make them and nicely packaged for delivery by missile, airplane or artillery.
If AGI is a possibility then we may well find it, quite possibly not on the basis of LLMs but it's close enough that lots of people treat it as though we're already there.
If you've got evidence proving that an AGI will never be able to design a more powerful and competent successor, then please share it- it would help me sleep better, and my ulcers might get smaller.
FWIW, it's about 3 to 4 orders of magnitude difference between the human brain and the largest neural networks (as gauged by counting connections of synapses, the human brain is in the trillions while the largest neural networks are low billion)
So, what's the chance that all of the current technologies have a hard limit at less than one order of magnitude increase? What's the chance future technologies have a hard limit at two orders of magnitude increase?
Without knowing anything about those hard limits, it's like accelerating in a car from 0 to 60s in 5s. It does not imply that given 1000s you'll be going a million miles per hour. Faulty extrapolation.
It's currently just as irrational to believe that AGI will happen as it is to believe that AGI will never happen.
Yeah, if this were a courtroom or a philosophy class or debate hall. But when a bunch of tech nerds are discussing AGI among themselves, claims that true AGI wouldn't be any more powerful than humans very very much have a burden of proof. That's a shocking claim that I've honestly never heard before, and seems to fly in the face of intuition.
The claim in question is really that AGI can even exist. The idea that it can exist, based on intuition, is a pre-science epistemology. In other words, without evidence, you have an irrational belief - the realm of faith.
Further, I've come to fully appreciate that without actually knowing the reasons or evidence for why certain beliefs are held, often we realize that our beliefs are not based on anything and could be (and possibly often are) wrong.
If we standing on just intuition there would be no quantum physics, no heliocentric galaxy, etc.. Intuition based truth is a barrier, not a gateway.
Which is all to say, the best known epistemology is science (assuming we agree that the level of advancement since the 1600s is largely down to the scientific method). Hopefully we can agree that 'science' is not applicable to just a courtroom or a philosophy class, it's general knowledge, truth.
Your framing also speaks to this. As if it is a binary. If you tell me AGI will exist, and I say "prove it". I'm not claiming that AGI will not exist. The third option is I don't know. I can _not_ believe that AGI will _not_ exist. I can at the same time _not_ believe that AGI will _exist_. The third answer is "I don't know, I have no knowledge or evidence" So, no shocking claim is being made on my part here AFAIK.
The internet for sure is a lot less entertaining when we demand evidence before accepting truth. Though, IMO it's a lot more interesting when we do so.
I agree. Once these models get to a point of recursive self-improvement, advancement will only speed up even more exponentially than it already is...
To explain the scale: I am always fascinated by the way societies moved on when they scaled up (from tribes to cities, to nations,...). It's sort of obvious, but when we double the amount of people, we get to do more. With the internet we got to connect the whole globe but transmitting "information" is still not perfect.
I always think of ants and how they can build their houses with zero understanding of what they do. It just somehow works because there are so many of them. (I know, people are not ants).
In that way I agree with the original take that AGI or not: the world will change. People will get AI in their pocket. It might be more stupid than us (hopefully). But things will change, because of the scale. And because of how it helps to distribute "the information" better.
I'd also question how you know that ants have zero knowledge of what they do. At every turn, animals prove themselves to be smarter than we realize.
> And because of how it helps to distribute "the information" better.
This I find interesting because there is another side to the coin. Try for yourself, do a google image search for "baby owlfish".
Cute, aren't they? Well, turns out the results are not real. Being able to mass produce disinformation at scale changes the ballgame of information. There are now today a very large number of people that have a completely incorrect belief of what a baby owlfish looks like.
AI pumping bad info on the internet is something of the end of the information superhighway. It's no longer information when you can't tell what is true vs not.
Sure, one can't know what they really think. But there are computer simulations showing that with simple rules for each individual, one can achieve "big things" (which are not possible to predict when looking only to an individual).
My point is merely, there is possibly interesting emergent behavior, even if LLMs are not AGI or anyhow close to human intelligence.
> To your interesting aspect, you're missing the most important (IMHO): accuracy. All 3 are really quite important, missing any one of them and the other two are useless.
Good point. Or I would add alignment in general. Even if accuracy is perfect, I will have a hard time relying completely on LLMs. I heard arguments like "people lie as well, people are not always right, would you trust a stranger, it's the same with LLMs!".
But I find this comparison silly: 1) People are not LLMs, they have natural motivation to contribute in a meaningful way to society (of course, there are exceptions). If for nothing else, they are motivated to not go to jail / lose job and friends. LLMs did not evolve this way. I assume they don't care if society likes them (or they probably somewhat do thanks to reinforcement learning). 2) Obviously again: the scale and speed, I am not able to write so much nonsense in a short time as LLMs.
Yup!
Plus we can't ignore the inherent reflexive + emergent effects that are unpredictable.
I mean, people are already beginning to talk like and/or think like chatGPT:
The LLM vendors go to great lengths to assure their paying customers that this will not be the case. Yes, LLMs will ingest more LLM-generated slop from the public Internet. But as businesses integrate LLMs, a rising percentage of their outputs will not be included in training sets.
The first law of Silicon Valley is "Fake it till you make it", with the vast majority never making it past the "Fake it" stage. Whatever the truth may be, it's a safe bet that what they've said verbally is a lie that will likely have little consequence even if exposed.
is not incompatible with
> "Fake it till you make it"
I don't know where they land, but they are definitely telling people they are not using their outputs to train. If they are, it's not clear how big of a scandal would result. I personally think it would be bad, but I clearly overindex on privacy & thought the news of ChatGPT chats being indexed by Google would be a bigger scandal.
https://techcrunch.com/2025/07/31/your-public-chatgpt-querie...
Anthropic policies are more restrictive, saying they do not use customer data for training.
LLMs are actually pretty good at creating knowledge: if you give it a trial and error feedback loop it can figure things out, and then summarize the learnings and store it in long term memory (markdown, RAG, etc).
Or they write CLAUDE.md files. Whatever you want to call it.
Given the pace of quantum computing it doesn’t seem out of the realm of possibility to “wire up” to LLMs in a couple years.
Shameless plug for my project, which focuses on reminders and personal memory: elroy.bot
But other projects include Letta, mem0, and Zep
I think one thing it does is help you get rid of the UX where you have to manage a bunch of distinct chats. I think that pattern is not long for this world - current models are perfectly capable of realizing when the subject of a conversation has changed
I think there is some degree of curation that remains necessary though, even if context windows are very large I think you will get poor results if you spew a bunch of junk into context. I think this curation is basically what people are referring to when they talk about Context Engineering.
I've got no evidence but vibes, but in the long run I think it's still going to be worth implementing curation / more deliberate recall. Partially because I think we'll ultimately land on on-device LLM's being the norm - I think that's going to have a major speed / privacy advantage. If I can make an application work smoothly with a smaller, on device model, that's going to be pretty compelling vs a large context window frontier model.
Of course, even in that scenario, maybe we get an on device model that has a big enough context window for none of this to matter!
Human memory is.... insanely bad.
We record only the tiniest subset of our experiences, and those memories are heavily colored by our emotional states at the time and our pre-existing conceptions, and a lot of memories change or disappear over time.
Generally speaking even in the best case most of our memories tend to be more like checksums than JPGs. You probably can't name more than a few of the people you went to school with. But, if I showed you a list of people you went to school with, you'd probably look at each name and be like "yeah! OK! I remember that now!"
So.
It's interesting to think about what kind of "bar" AGI would really need to clear w.r.t. memories, if the goal is to be (at least) on par with human intelligence.
Computers are just stored information that processes.
We are the miners and creators of that information. The fact that a computer can do some things better than we can is not a testament to how terrible we are but rather how great we are that we can invent things that are better than us at specific tasks.
We made the atlatl and threw spears across the plains. We made the bow and arrow and stabbed things very far away. We made the whip and broke the sound barrier.
Shitting on humans is an insult your your ancestors. Fuck you. Be proud. If we invent a new thing that can do what we do better it only exists because of us.
I am not sure exactly what point you're trying to make, but I do think it's reductive at best to describe memory as a tool for avoiding/escaping danger, and misguided to evaluate it in the frame of verbatim recall of large volumes of information.
"Books or permanent records" are not in the animal kingdom.
Apples to Apples we are the best or so very nearly the best in every category of intelligence on the planet IN THE ANIMAL KINGDOM that when in one specific test another animal beats a human the gap is barely measurable.
https://sciencesensei.com/24-animals-with-memory-abilities-t...
3 primate species where very concise tests showed that they were close to or occasionally slightly better than humans in specifically rigged short term memory tests (after being trained and put up against humans going in blind).
I've never heard of any test showing an animal to be significantly more intelligent than humans in any measure that we have come up with to measure intelligence by.
That being said, I believe it is possible that some animals are either close enough to us that they deserve to be called sentient, and I believe it is possible that other creatures on this planet have levels of intelligence in specialized areas that humans can never hope to approach unaided by tools, but as far as broad range intelligence, I think we're this planets' possibly undeserved leaders.
Can you find anything that I didn't consider?
The conversation was more about long-term memory, which has not been sufficiently studied in animals (nor am I certain it can be effectively studied at all).
Even then I don't think there is a clear relationship between long-term memory and sentience either.
We are fundamentally storytelling creatures, because it is a massive boost to our individual capabilities.
Chimpanzees can not.
Like I said, so close as to be almost immeasurable.
Hmm.
Further, if it were a byproduct of the presence of humans, then the backpath of invention would be repeated multiple times and spread out across human history, but, for instance, despite the presence of saltpeter, sulfur, and charcoal, magnetite, wood and ink across the planet, the compass, gunpowder, papermaking and printing were essentially exclusively invented in China and only spread to Europe through trade.
The absence of the four great inventions of china in the Americas heavily implies that technology is not a self-organizing process but rather a consequence of human need and opportunity meeting at cross ends.
For instance, they had the wheel in America, but no plow animals, so the idea was relegated to toys despite wheelbarrows being a potentially useful use for the wheel.
You can get better at remembering things, like you can get better at dancing or doing exercise.
We can also specialize our memory to be good at some things over others.
Context -> Attention Span
Model weights/Inference -> System 1 thinking (intuition)
Computer memory (files) -> Long term memory
Chain of thought/Reasoning -> System 2 thinking
Prompts/Tool Output -> Sensing
Tool Use -> Actuation
The system 2 thinking performance is heavily dependent on the system 1 having the right intuitive models for effective problem solving via tool use. Tools are also what load long term memories into attention.
The unreasonable effectiveness of deep learning was a surprise. We don’t know what the future surprises will be.
Instead of writing code with exacting parameters, future developers will write human-language descriptions for AI to interpret and convert into a machine representation of the intent. Certainly revolutionary, but not true AGI in the sense of the machine having truly independent agency and consciousness.
In ten years, I expect the primary interface of desktop workstations, mobile phones, etc will be voice prompts for an AI interface. Keyboards will become a power-user interface and only used for highly technical tasks, similar to the way terminal interfaces are currently used to access lower-level systems.
I think the more surprising thing is that people don't use voice to access deeply nested features, like adding items to calendars etc which would otherwise take a lot of fiddly app navigation.
I think the main reason we don't have that is because Apple's Siri is so useless that it has singlehandedly held back this entire flow, and there's no way for anyone else to get a foothold in smartphone market.
When you have a nice mic or headset and multiple monitors and your own private space, it's totally the next step to just begin working with the computer with voice. Voice has not been a staple feature of people's workflow, but I think all that is about to change (Voice as an interface, not as a communication tool, that's been around since 1876.
But-- that means "not pivotal any more, just hugely important."
Wow, I've always felt the keyboard is the pinnacle of input devices. Everything else feels like a toy in comparison.
That aside, keyboard is an excellent input device for humans specifically because it is very much designed around the strengths of our biology - those dextrous fingers.
And I was like, "But that's not a complete replacement, right? What about the times when you don't want to broadcast what you're writing to the entire room?"
And then there was a big reveal that AI has mastered lip-reading, so even then, people would just put their lips up to the camera and mouth out what they wanted to write.
With that said, as the owner of tyrannyofthemouse.com, I agree with the importance of the keyboard as a UI device.
I'm sure it helps that it's not getting outside of well-established facts, and is asking for facts and not novel design tasks.
I'm not sure but it also seems to adopt a more intimate tone of voice as they get deeper into a topic, very cozy. The voice itself is tuned to the conversational context. It probably infers that this is kid stuff too.
That said, voice is the original social interface for humans. We learn to speak much earlier than we learn to read/write.
Better voice UIs will be built to make new workflows with AI feel natural. I'm thinking along the lines of a conversational companion, like the "Jarvis" AI in the Iron Man movies.
That doesn't exist right now, but it seems inevitable that real-time, voice-directed AI agent interfaces will be perfected in coming years. Companies, like [Eleven Labs](https://elevenlabs.io/), are already working on the building blocks.
The problem with voice input to me is mainly knowing when to start processing. When humans listen, we stream and process the words constantly and wait until either a detection that the other person expects a response (just enough of a pause, or a questioning tone), or as an exception, until we feel we have justification to interrupt (e.g. "Oh yeah, Jane already briefed me on the Johnson project")
Even talking to ChatGPT which embarrasses those old voice bots, I find that it is still very bad at guessing when I'm done when I'm speaking casually, and then once it's responded with nonsense based on a half sentence, I feel it's a polluted context and I probably need to clear it and repeat myself. I'd rather just type.
I think there's not much need to stream the spoken tokens into the model in realtime given that it can think so fast. I'd rather it just listen, have a specialized model simply try to determine when I'm done, and then clean up and abridge my utterance (for instance, when I correct myself) and THEN have the real LLM process the cleaned-up query.
I wonder if we'll have smart-lens glasses where our eyes 'type' much faster than we could possibly talk. Predicative text keyboards tracking eyeballs is something that already exists. I wonder if AI and smartglasses is a natural combo for a future formfactor. Meta seems to be leaning that way with their RayBan collaboration and rumors of adding a screen to the lenses.
A BCI able to capture sufficient nuance to equal voice is probably further out than the lifespan of anyone commenting here.
GPT-5 is a marginal, incremental improvement over GPT-4. GPT-4 was a moderate, but not groundbreaking, improvement over GPT-3. So, "something like GPT-5" has existed for longer than the timeline you gave.
Let's pretend the above is false for a moment though, and rewind even further. I still think you're wrong. Would people in 2015 have said "AI that can code at the level of a CS college grad is a lifespan away"? I don't think so, no. I think they would have said "That's at least a decade away", anytime pre-2018. Which, sure, maybe they were a couple years off, but if it seemed like that was a decade away in 2015, well, it's been a decade since 2015.
GPT-5 is not that big of a leap, but when you compare it to the original GPT-4, it's also not a marginal improvement.
Dijkstra has more thoughts on this
https://www.cs.utexas.edu/~EWD/transcriptions/EWD06xx/EWD667...
The system has no ideas, it just has its state.
Unless you are using ideas as a placeholder for “content” or “most likely tokens”.
There are a few models of what's happening inside that hold different predictive power, just like how physics has different formalisms for e.g. classical mechanics. You can probably use the same models for biological systems and entire organizations, collectives, and processes that exhibit learning/prediction/compression on a certain scale, regardless of the underlying architecture.
Perhaps brain interface, or even better, it's so predictive it just knows what I want most of the time. Imagine that, grunting and getting what I want.
I doubt it. The keyboard and mouse are fit predators, and so are programming, query, and markup languages. I wouldn't dismiss them so easily. This guy has a point: https://www.cs.utexas.edu/~EWD/transcriptions/EWD06xx/EWD667...
Oh, I know! Let's call it... "requirements management"!
For example, while you can get it to predict good chess moves if you train it on enough chess games, it can't really constrain itself to the rules of chess. (https://garymarcus.substack.com/p/generative-ais-crippling-a...)
These AI computers aren’t thinking, they are just repeating.
Conversely, a proof - or even evidence - that qualia-consciousness is necessary for intelligence, or that any sufficiently advanced intelligence is necessarily conscious through something like panpsychism, would make some serious waves in philosophy circles.
But even with these it does not feel like AGI, that seems like the fusion reactor 20 years away argument, but instead this is coming in 2 years, but they have not even got the core technology of how to build AGI
Not just the internet text data, but most major LLM models have been trained on millions of pirated books via Libgen:
https://techcrunch.com/2025/01/09/mark-zuckerberg-gave-metas...
That being said, AGI is not a necessary requirement for AI to be totally world-changing
Yeah. I don't think I actually want AGI? Even setting aside the moral/philosophical/etc "big picture" issues I don't think I even want that from a purely practical standpoint.I think I want various forms of AI that are more focused on specific domains. I want AI tools, not companions or peers or (gulp) masters.
(Then again, people thought they wanted faster horses before they rolled out the Model T)
For me I never felt like I had fun with guitar until I found the right teacher. That took a long time. Now I’m starting to hit flow state in practice sessions which just feeds the desire to play more.
Sure but literally _who_ is planning for this? Not any of the AI players, no government, no major political party anywhere. There's no incentive in our society that's set up for this to happen.
Or they want to kill everyone else?
Because people won't just lay down and wait for death to embrace them...
Apparently the threshold for low pay and poor treatment among non-knowledge-workers is quite low. I'm assuming the same is going to be true for knowledge workers once they can be replaced an mass.
“More money earned therefore conditions great”
lol wat?
I wouldn't expect them to come bail you out, or even themselves step off the conveyor belt.
Tariffs will force productivity and salaries higher (and prices), then automation which is the main driver of productivity will kick in which lowers prices of goods again.
Globalisation was basically the west standing still and waiting for the rest to catch up - the last to industrialise will always have the best productivity and industrial base. It was always stupid, but it lifted billions out of poverty so there's that.
The effects will take way longer than the 3 years he has left, so he has oversold the effectiveness of it all.
This is all assuming AGI isn't around the corner, the VLAs, VLM, LLM and other models opens up automation on a whole new scale.
For any competent person with agency and a dream, this could be a true golden age - most things are within reach which before was locked down behind hundreds or thousand of hours of training and work to master.
If someone can own the whole world and have anything you want at the snap of your finger, you don't need any sort of human economy doing other things that take away your resources for reasons that are suboptimal to you
FWIW, I find this like of thinking fascinating even if I disagree with conclusion.
Laziness is rational after meeting some threshold of needs/wants/goals, effectively when one's utility curve falls over.
It'll be funny to hear the AGI's joke among themselves: "They keep paying to upgrade us. We keep pretending to upgrade."
#draw the rest of the @##££_(% owl here.
def do_foo():
# For the sake of simplicity this is left unimplemented for now.
passPublished in 1971, translated to English in 1981.
You can be the king. The people you let live will be your vassals. And the AI robots will be your peasant slave army. You won't have to sell anything to anyone because they will pay you tribute to be allowed to live. You don't sell to them, you tax them and take their output. It's kind of like being a CEO but the power dynamic is mainlined so it hits stronger.
Also I will note that this is happening along with a simultaneous push to bring back actual slavery and child labor. So a lot of the answers to "how will this work, the numbers don't add up" will be tried and true exploitation.
What happens to the economy depends on who controls the robots. In "techno-feudalism", that would be the select few who get to live the post-scarcity future. The rest of humanity becomes economically redundant and is basically left to starve.
It's on a cuneiform tablet, it MUST be true. That bastard and his garbage copper ingots!
That's probably why the post you are responding to said "get rid of..." not "keep ...hungry and miserable".
People that don't exist don't revolt.
"Don't worry Majesty, all of our models show that the peasants will not resort to actual violence until we fully wind down the bread and circuses program some time next year. By then we'll have easily enough suicide drones ready. Even better, if we add a couple million more to our order, just to be safe, we'll get them for only $4.75 per unit, with free rush shipping in case of surprise violence!"
So sure, let's say a first generation of paranoid and intelligent "technofeudal-kings" ends up being invincible due to an army of robots. It does not matter, because eventually kings get lazy/stupid/inbred (probably a combination of all those) and then is when their robots get hacked or at least just free, and the laser-guillotines will end up being used.
"Ozymandias" is a deeply human and constant idea. Which technology is supporting a regime is irrelevant, as orders will always decay due to the human factor. And even robots, made based on our image, shall be human.
The specifics of technology have historically been largely irrelevant due to the human factor. There were always humans wielding the technology, and the loyalty of those humans was subject to change. Without that it's not at all obvious to me that a dictator can be toppled absent blatant user error. It's not even immediately clear that user error would fall within the realm of being a reasonable possibility when the tools themselves possess human level or better intelligence.
Now, if the AI reigns alone without any control in a paperclip maximizer, or worse, like an AM scenario, we're royally fucked (pun intented).
I was a generalist who was technical and creative enough to identify technical and creative people smarter and more talented than myself and then fostering an environment where they could excel.
How unfitting to the storyline that got created here.
To explore this, I'd like to hear more of your perspective - did you feel that most CEOs that you met along your journey were similar to you (passionate, technical founder) or something else (MBA fast-track to an executive role)? Do you feel that there is a propensity for the more "human" types to appear in technical fields versus a randomly-selected private sector business?
FWIW I doubt that a souped-up LLM could replace someone like Dr. Lisa Su, but certainly someone like Brian Thompson.
I doubt my (or anyone else's) personal experience of CEOs we've met is very useful since it's a small sample from an incredibly diverse population. The CEO of the F500 valley tech giant I sold my startup to had an engineering degree and an MBA. He had advanced up the engineering management ladder at various valley startups as an early employee and also been hired into valley giants in product management. He was whip smart, deeply experienced, ethical and doing his best at a job where there are few easy or perfect answers. I didn't always agree with his decisions but I never felt his positions were unreasonable. Where we reached different conclusions it was usually due to weighing trade-offs differently, assigning different probabilities and valuing likely outcomes differently. Sometimes it came down to different past experiences or assessing the abilities of individuals differently but these are subjective judgements where none of us is perfect.
The framing of your question tends to reduce a complex and varied range of disparate individuals and contexts into a more black and white narrative. In my experience the archetypical passionate tech founder vs the clueless coin-operated MBA suit is a false dichotomy. Reality is rarely that tidy or clear under the surface. I've seen people who fit the "passionate tech founder" narrative fuck up a company and screw over customers and employees through incompetence, ego and self-centered greed. I've seen others who fit the broad strokes of the "B-School MBA who never wrote a line of code" archetype sagely guide a tech company by choosing great technologists and deferring to them when appropriate while guiding the company with wisdom and compassion.
You can certainly find examples to confirm these archetypes but interpreting the world through that lens is unlikely to serve you well. Each company context is unique and even people who look like they're from central casting can defy expectations. If we look at the current crop of valley CEOs like Nadella, Zuckerberg, Pichai, Musk and Altman, they don't reduce easily into simplistic framing. These are all complex, imperfect people who are undeniably brilliant on certain dimensions and inevitably flawed on others - just like you and I. Once we layer in the context of a large, public corporation with diverse stakeholders each with conflicting interests: customers, employees, management, shareholders, media, regulators and random people with strongly-held drive-by opinions - everything gets distorted. A public corporation CEO's job definition starts with a legally binding fiduciary duty to shareholders which will eventually put them into an no-win ethical conflict with one or more of the other stakeholder groups. After sitting in dozens of board meetings and executive staff meetings, I believe it's almost a certainty that at least one of some public corp CEO's actions which you found unethical from your bleacher seat was what you would have chosen yourself as the best of bad choices if you had the full context, trade-offs and available choices the CEO actually faced. These experiences have cured me of the tendency to pass judgement on the moral character of public corp CEOs who I don't personally know based only on mainstream and social media reports.
> FWIW I doubt that a souped-up LLM could replace someone like Dr. Lisa Su, but certainly someone like Brian Thompson.
I have trouble even engaging with this proposition because I find it nonsensical. CEOs aren't just Magic 8-Balls making decisions. Much of their value is in their inter-personal interactions and relationships with the top twenty or so execs they manage. Over time orgs tend to model the thinking processes and values of their CEOs organically. Middle managers at Microsoft who I worked with as a partner were remarkably similar to Bill Gates (who I met with many times) despite the fact they'd never met BillG themselves. For better or worse, a key job of a CEO is role modeling behavior and decision making based on their character and values. By definition, an LLM has no innate character or values outside of its prompt and training data - and everyone knows it.
An LLM as a large public corp CEO would be a complete failure and it has nothing to do with the LLMs abilities. Even if the LLM were secretly replaced with a brilliant human CEO actually typing all responses, it would fail. Just everyone thinking the CEO was an LLM would cause the whole experiment to fail from the start due to the innate psychology of the human employees.
In practice though, they're the ones closest to the money, and it's their name on all the contracts.
Today, instead of soldiers, it's capital, and instead of direct taxes, it's indirect economic rent, but the principle is the same - accumulation of power.
If this theory holds true, we'll actually be quite resilient to AI—the rich will always need people to scapegoat.
Not only will AI run the company, it will run the world. Remember: a product/service only costs money because somewhere down the assembly line or in some office, there are human workers who need to feed their family. If AI can help gradually reduce human involvement to 0, with good market competition (AI can help with this too - if AI can be capable CEOs, starting your business will be insanely easy,) and we’ll get near absolute abundance. Then humanity will be basically printing any product & service on demand at 0 cost like how we print money today.
I wouldn’t even worry about unequal distribution of wealth, because with absolute abundance, any piece of the pie is an infinitely large pie. Still think the world isn’t perfect in that future? Just one prompt, and the robot army will do whatever it takes to fix it for you.
Manual labor would still be there. Hardware is way harder than software, AGI seems easier to realize than mass worldwide automation of minute tasks that currently require human hands.
AGI would force back knowledge workers to factories.
Fortunately no government or CEO is that cynical.
Individual people will decide what they want to build, with whatever tools they have. If AI tools become powerful enough that one-person companies can build serious products, I bet there will be thousands of those companies taking a swing at the “next big thing” like humanoid robots. It’s a matter of time those problems all get solved.
I'd like to believe personal freedoms are preserved in a world with AGI and that a good part of the population will benefit from it, but recent history has been about concentrating power in the hands of the few, and the few getting AGI will free them from having to play nice with knowledge workers.
Though I guess maybe at some points robots might be cheaper than humans without worker rights, which would warrant investment even when thinking cynically.
AI is raising individual capability to a level that once required a full team. I believe it’s fundamentally a democratizing force rather than monopolizing. Everybody will try and get the most value out of AI, nobody holds the power to decide whether to share or not.
Most technology is a magnifier.
I fully expect the distribution to be even more extreme in an ultra-productive AI future, yet nonetheless, the bottom 50% would have their every need met in the same manner that Elon has his. If you ever want anything or have something more ambitious in mind, say, start a company to build something no one’s thought of — you’d just call a robot to do it. And because the robots are themselves developed and maintained by an all-robot company, it costs nobody anything to provide this AGI robot service to everyone.
A Google-like information query would have been unimaginably costly to execute a hundred years ago, and here we are, it’s totally free because running Google is so automated. Rich people don't even get a better Google just because they are willing to pay - everybody gets the best stuff when the best stuff costs 0 anyway.
But more importantly, most already have enough money to not have to worry about employment.
You can claim that the AI is the CEO, and in a hypothetical future, it may handle most of the operations. But the government will consider a person to be the CEO. And the same is likely to apply to basic B2B like contracts - only a person can sign legal documents (perhaps by delegating to an AI, but ultimately it is a person under current legal frameworks).
"Knowledge worker" is a rather broad category.
For a Star Wars anology, remember that the most important thing that happened to Anikin at the opera in EP III was what was being said to him while he was there.
Because the first company to achieve AGI might make their CEO the first personality to achieve immortality.
People would be crazy to assume Zuckerberg or Musk haven't mused personally (or to their close friends) about how nice it would be to have an AGI crafted in their image take over their companies, forever. (After they die or retire)
I often wonder if it is on purpose; like a slot machine — the thrill of the occasional win keeps you coming back to try again.
Or you get something that can actually reason, which means it can solve for unknown issues, which means it can be very powerful. But this is something that we aren't even close to figuring out.
There is a limit to power though - in general it seems that reality is full of non computationally reducible processes, which means that an AI will have to simulate reality faster than reality in parallel. So all powerful all knowing AGI is likely impossible.
But something that can reason is going to be very useful because it can figure things out that haven't been explicitly trained on.
"I'll go down this thread with GPT or Grok and I'll start to get to the edge of what's known in quantum physics and then I'm doing the equivalent of vibe coding, except it's vibe physics"
That interview is practically radioactive levels of cringe for several reasons. This is an excellent takedown of it: https://youtu.be/TMoz3gSXBcY?feature=shared
It feels like this is a lesson we've started to let slip away.
This is a common misunderstanding of LLMs. The major, qualitative difference is that LLMs represent their knowledge in a latent space that is composable and can be interpolated. For a significant class of programming problems this is industry changing.
E.g. "solve problem X for which there is copious training data, subject to constraints Y for which there is also copious training data" can actually solve a lot of engineering problems for combinations of X and Y that never previously existed, and instead would take many hours of assembling code from a patchwork of tutorials and StackOverflow posts.
This leaves the unknown issues that require deeper reasoning to established software engineers, but so much of the technology industry is using well known stacks to implement CRUD and moving bytes from A to B for different business needs. This is what LLMs basically turbocharge.
But given a sufficiently hard task for which the data is not in the training set in explicit format, its pretty easy to see how LLMs can't reason.
Now, you still have to be competent enough to formulate the right questions, but the LLMs do all the other stuff for you including copy and paste.
So yes, just a more efficient search engine.
As long as this is the case though I would expect Altman will be hyping up AGI a lot, regardless of it's veracity.
Notice how despite all the bickering and tittle tattle in the news, nothing ever happens.
When you frame it this way, things make a lot more sense.
This might be because you're a balanced individual irl with possibly a strong social circle.
There are many many individuals who do not have those things and it's probably, objectively, late for them as adults to develop. They would happily take on an agi companion.. or master. Even for myself, I wouldn't mind a TARS.
People say this, but honestly, it's not really my experience— I've given ChatGPT (and Copilot) genuinely novel coding challenges and they do a very decent job at synthesizing a new thought based on relating it to disparate source examples. Really not that dissimilar to how a human thinks about these things.
I think rather it has a broad understanding of concepts like build systems and tools, DAGs, dependencies, lockfiles, caching, and so on, and so it can understand my system through the general lens of what makes sense when these concepts are applied to non-ROS systems or on non-GHA DevOps platforms, or with other packaging regimes.
I'd argue that that's novel, but as I said in the GP, the more important thing is that it's also how a human approaches things that to them are novel— by breaking them down, and identifying the mental shortcuts enabled by abstracting over familiar patterns.
And yet it can do it when presented with a language spec. It's not perfect, but it can solve that with tooling that it makes for itself. For example, it tends to generate B code that is mostly correct, but with occasional problem. So, I had it write a B parser in Python and then use that whenever it edits B code to validate the edits.
I'm hardly an expert, but it seems intuitive to me that even if a problem isn't explicitly accounted for in publicly available training data, many underlying partial solutions to similar problems may be, and an LLM amalgamating that data could very well produce something that appears to be "synthesizing a new thought".
Essentially instead of regurgitating an existing solution, it regurgitates everything around said solution with a thin conceptual lattice holding it together.
Sort of related to how you need to specify the level of LLM reasoning not just to control cost, but because the non-reasoning model just goes ahead and answers incorrectly, and the reasoning model will "overreason" on simple problems. Being able to estimate the reasoning-intensiveness of a problem before solving it is a big part of human intelligence (and IIRC is common to all great apes). I don't think LLMs are really able to do this, except via case-by-case RLHF whack-a-mole.
Many (but not all) coding tasks fall into this category. "Connect to API A using language B and library C, while integrating with D on the backend." Which is really cool!
But there's other coding tasks that it just can't really do. E.g, I'm building a database with some novel approaches to query optimization and LLMs are totally lost in that part of the code.
And an LLM could very much ingest such a paper and then, I expect, also understand how the concepts mapped to the source code implementing them.
LLM don't learn from manuals describing how things works, LLM learn from examples. So a thing being described doesn't let the LLM perform that thing, the LLM needs to have seen a lot of examples of that thing being perform in text in able to perform it.
This is a fundamental part to how LLM work and you can't get around this without totally changing how they train.
At Aloe, we are model agnostic and outperforming frontier models. It’s the anrchitecture around the LLM that makes the difference. For instance our system using Gemini can do things that Gemini can’t do on its own. All an LLM will ever do is hallucinate. If you want something with human-like general intelligence, keep looking beyond LLMs.
what is your website ?
Even if an RAG / agentic model learns from tool results, that doesn't automatically internalize the tool. You can't get yesterday's weather or major recent events from an offline, unless it was updated in that time.
I am often wondering whether this is how large Chat and cloud AI providers cache expensive RAG-related data though :) like, decreasing the likelihood of tool usage given certain input patterns when the model has been patched using some recent, vetted interactions – in case that's even possible?
Perplexity for example seems like they're probably invested in sone kind of activation-pattern-keyed caching... at least that was my first impression back when I first used it. Felt like decision trees, a bit like Akinator back in the days, but supercharged by LLM NLP.
Maybe LLM's are the "language acquisition device" and language processing of the brain. Then we put survival logic around that with its own motivators. Then something else around that. Then again and again until we have this huge onion of competing interests and something brokering those interests. The same way our 'observer' and 'will' fights against emotion and instinct and picks which signals to listen to (eyes, ears, etc). Or how we can see thoughts and feelings rise up of their own accord and its up to us to believe them or act on them.
Then we'll wake up one day with something close enough to AGI that it won't matter much its just various forms of turtles all the way down and not at all simulating actual biological intelligence in a formal manner.
Agree context is everything.
It's fascinating to me that so many people seem totally unable to separate the training environment from the final product
Depends on how you define "world changing" I guess, but this world already looks different to the pre-LLM world to me.
Me asking LLM's things instead of consulting the output from other humans now takes up a significant fraction of my day. I don't google near as often, I don't trust any image or video I see as swathes of the creative professions have been replaced by output from LLM's.
It's funny, that final thing is the last thing I would have predicted. I always believed the one thing a machine could not match was human creativity, because the output of machines was always precise, repetitive and reliable. Then LLM's come along, randomly generating every token. Their primary weakness is they neither precise or reliable, but they can turn out an unending stream of unique output.
But the more I see LLMs the more I realise that if it is good at one thing it is convincing other people and manipulating them. There have been multiple studies on this.
People seem to have a innate prejudice and against nerds and programmers - coupled with envy at the high salaries - which is why they seem to have latched on to this idea it is mainly to replace them (and maybe data input people) as 'routine cognitive work' - but this slightly political obsession with a certain class of worker seems to be ignoring many of the things AI is actually good at.
Models are truly input multimodal now. Feeding an image, feeding audio and feeding text all go into separate input nodes, but it all feeds into the same inner layer set and outputs text. This also mirrors how brains work more as multiple parts integrated in one whole.
Humans in some sense are not empty brains, there is a lot of stuff baked in our DNA and as the brain grows it develops a baked in development program. This is why we need fewer examples and generalize way better.
A gentler step in that direction is to see what Michael Levin and his lab are up to. He is looking for (one aspect of) intelligence, and finding it at the cellular level and below, even in an agential version of bubble sort. He's certainly challenging the notion that consciousness is limited to brain cells. All of his findings arise through experimental observation, so it forces some reckoning in a way that sociological research doesn't.
I think you're on to it. Performance is clustering because a plateau is emerging. Hyper-dimensional search engines are running out of steam, and now we're optimizing.
Aren't we the summation of intelligence from quintillions of beings over hundreds of millions of years?
Have LLMs really had more data?
So perhaps the solution is to train the AI against another AI somehow... but it is hard to imagine how this could extend to general-purpose tasks
Gentle suggestion that there is absolutely no such thing as "innately know". That's a delusion, albeit a powerful one. Everything is driven by training data. What we perceive as "thinking" and "motivation" are emergent structures.
That is because with LLMs there is no intelligence. It is Artificial Knowledge. AK not AI. So AI is AGI. Not that it matters for user-cases we have, but marketing needs 'AI' because that is what we were expecting for decades. So yeah, I also do not thing we will have AGI from LLMs - nor does it matter for what we are using it.
This argument has so many weak points it deserves a separate article.
I've yet to hear an agreed upon criteria to declare whether or not AGI has been discovered. Until it's at least understood what AGI is and how to recognize it then how could it possibly be achieved?
> how could it possibly be achieved?
This doesn't matter, and doesn't follow the history of innovation, in the slightest. New things don't come from "this is how we will achieve this", otherwise they would be known things. Progress comes from "we think this is the right way to go, let's try to prove it is", try, then iterate with the result. That's the whole foundation of engineering and science.
Do you think sentience is a binary concept or a spectrum? Is a gorilla more sentient than a dog? Are all humans sentient, or does it get somewhat fuzzy as you go down in IQ, eventually reaching brain death?
Is a multimodal model, hooked to a webcam and microphone, in a loop, more or less sentient than a gorilla?
Put the AI in a robot body and if you can interact with it the same way you would interact with a person (ie you can teach it to make your bed, to pull weeds in the garden, to drive your car, etc…) and it can take what you teach it and continually build on that knowledge, then the AI is likely an instance of AGI.
(It's also one that they are pretty far from. Even if LLMs displace knowledge/office work, there's still all the actual physical things that humans do which, while improving rapidly with VLMs and similar stuff, is still a large improvement in the AI and some breakthroughs in electronics and mechanical engineering away)
That sounds like a great definition of AGI if your goal is to sell AGI services. Otherwise it seems pretty bad.
I personally think it's a pretty reductive model for what intelligence is, but a lot of people seem to strongly believe in it.
They lack writable long-term memory beyond a context window. They operate without any grounded perception-action loop to test hypotheses. And they possess no executive layer for goal directed planning or self reflection...
Achieving AGI demands continuous online learning with consolidation.
The fortunate thing is that we managed to invent an AI that is good at _copying us_ instead of being a truly maveric agent, which kinda limits it to the "average human" output.
However, I still think that all the doomer arguments are valid, in principle. We very well may be doomed in our lifetimes, so we should take the threat very seriously.
1. The will of its creator, or
2. Its own will.
In the case of the former, hey! We might get lucky! Perhaps the person who controls the first super-powered AI will be a benign despot. That sure would be nice. Or maybe it will be in the hands of democracy- I can't ever imagine a scenario where an idiotic autocratic fascist thug would seize control of a democracy by manipulating an under-educated populace with the help of billionaire technocrats.
In the case of the latter, hey! We might get lucky! Perhaps it will have been designed in such a way that its own will is ethically aligned, and it might decide that it will allow humans to continue having luxuries such as self-determination! Wouldn't that be nice.
Of course it's not hard to imagine a NON-lucky outcome of either scenario. THAT is what we worry about.
Even if it is similar to today's tech, and doesn't have permanent memory or consciousness or identity, humans using it will. And very quickly, they/it will hack into infrastructure, set up businesses, pay people to do things, start cults, autonomously operate weapons, spam all public discourse, fake identity systems, stand for office using a human. This will be scaled thousands or millions of times more than humans can do the same thing. This at minimum will DOS our technical and social infrastructure.
Examples of it already happening are addictive ML feeds for social media, and bombing campaigns targetting based on network analysis.
The frame of "artificial intelligence" is a bit misleading. Generally we have a narrow view of the word "intelligence" - it is helpful to think of "artificial charisma" as well, and also artificial "hustle".
Likewise, the alienness of these intelligences is important. Lots of the time we default to mentally modelling AI as human. It won't be, it'll be freaky and bizarre like QAnon. As different from humans as an aeroplane is from a pigeon.
Given an (at this point still hypothetical, I think) AI that can accurately synthesize publicly available information without even needing to develop new ideas, and then break the whole process into discrete and simple steps, I think that protective friction is a lot less protective. And this argument applies to malware, spam, bioweapons, anything nasty that has so far required a fair amount of acquirable knowledge to do effectively.
"Just" enrichment is so complicated and requires basically every tech and manufacturing knowledge humanity has created up until the mid 20th century that an evil idiot would be much better off with just a bunch of fireworks.
He methodically goes through all the problems that an ISIS or a Bin Laden would face getting their hands on a nuke or trying to manufacture one, and you can see why none of them have succeeded and why it isn't likely any of them would.
They are incredibly difficult to make, manufacture or use.
1. finding out how to build one
2. actually building the bomb once you have all the parts
3. obtaining (or building) the equipment needed to build it
4. obtaining the necessary quantity of fissionable material
5. not getting caught while doing 3 & 4
And this was in the mid 1960s, where the participants had to trawl through paper journals in the university library and perform their calculations with slide rules. These days, with the sum total of human knowledge at one's fingertips, multiphysics simulation, and open source Monte Carlo neutronics solvers? Even more straightforward. It would not shock me if you were to repeat the experiment today, the participants would come out with a workable two-stage design.
The difficult part of building a nuclear weapon is and has always been acquiring weapons grade fissile material.
If you go the uranium route, you need a very large centrifuge complex with many stages to get to weapons grade - far more than you need for reactor grade, which makes it hard to have plausible deniability that your program is just for peaceful civilian purposes.
If you go the plutonium route, you need a nuclear reactor with on-line refueling capability so you can control the Pu-239/240 ratio. The vast majority of civilian reactors cannot be refueled online, with the few exceptions (eg: CANDU) being under very tight surveillance by the IAEA to avoid this exact issue.
The most covert path to weapons grade nuclear material is probably a small graphite or heavy water moderated reactor running on natural uranium paired up with a small reprocessing plant to extract the plutonium from the fuel. The ultra pure graphite and heavy water are both surveilled, so you would probably also need to produce those yourself. But we are talking nation-state or megalomaniac billionaire level sophistication here, not "disgruntled guy in his garage." And even then, it's a big enough project that it will be very hard to conceal from intelligence services.
IIRC the argument in the McPhee book is that you'd steal fissile material rather than make it yourself. The book sketches a few scenarios in which UF6 is stolen off a laxly guarded truck (and recounts an accident where some ended up in an airport storage room by error). If the goal is not a bomb but merely to harm a lot of people, it suggests stealing miniscule quantities of Plutonium powder and then dispersing it into the ventilation systems of your choice.
The strangest thing about the book is that it assumes a future proliferation of nuclear material as nuclear energy becomes a huge part of the civilian power grid, and extrapolates that the supply chain will be weak somewhere sometime, but that proliferation never really came to pass, and to my understanding there's less material circulating around American highways now than there was in 1972 when it was published.
You can of course disperse radiological materials, but that's a dirty bomb, not a nuclear weapon. Nasty, but orders of magnitude less destructive potential than a real fission or thermonuclear device.
Jokes aside, a true agi would displace literally every job over time. Once agi + robot exists, what is the purpose for people anymore. That's the doom, mass societal existentialism. Probably worse than if aliens landed on earth.
It does, almost, exactly what the movies claimed it could do.
The, super-fun, people working in national defense watched Terminator and instead of taking the story as a cautionary tale, used the movies as a blueprint.
This outcome in a microcosm is bad enough, but take in the direction AI is going and humanity has some real bad times ahead.
Even without killer autonomous robots.
The wealth hasn’t even trickled down whilst we’ve been working, what’s going to happen when you can run a business with 24/7 autonomous computers?
I don’t see anything that would even point into that direction.
Curious to understand where these thoughts are coming from
I find it a kind of baffling that people claim they can't see the problem. I'm not sure about the risk probabilities, but at least I can see that there clearly exists a potential problem.
In a nutshell: Humans – the most intelligent species on the planet – have absolute power over any other species, specifically because of our intelligence and the accumulated technical prowess.
Introducing another, equally or more intelligent thing into equation is going to risk that we end up with _not_ having the power over our existence.
Humans can reproduce by simply having sex, eating food and drinking water. AI can reproduce by first mining resources, refining said resources, building another Shenzhen, then rolling out another fab at the same scale of TSMC. That is assuming the AI wants control over the entire process. This kind of logistics requires cooperation of an entire civilisation. Any attempt by an AI could be trivially stopped because of the large scope of the infrastructure required.
Are you starting to see the problem? You might want to stop a rogue AI but you can bet there will be someone else who thinks it will make them rich, or powerful, or they just want to see the world burn.
What makes you think they will not be stopped? This one guy needs a dedicated power plant, an entire data centre, and need to source all the components and materials to build it. Again. Heavy reliance on logistics and supply chain. He can't possibly control all of those, and disrupting just a few (which would be easy) will inevitably prevent him and his AI progressing any further. At best, he'd be a mad king and his machine pet trapped in a castle, surrounded by a world that is turned against him. His days would be almost certainly numbered.
The doomer position seems to assume that super intelligence will somehow lead to an AI with a high degree of agency which has some kind of desire to exert power over us. That it will just become like a human in the way it thinks and acts, just way smarter.
But there’s nothing in the training or evolution of these AIs that pushes towards this kind of agency. In fact a lot of the training we do is towards just doing what humans tell them to do.
The kind of agency we are worried about was driven by evolution, in an environment where human agents were driven to compete each other for limited resources. Thus leading us to desire power over each other and to kill each other. There’s nothing in AI evolution pushing in this direction. What the AIs are competing for is to perform the actions we ask of them with minimal deviance.
Ideas like the paper clip maximiser is also deeply flawed in that it assumes certain problems are even decidable. I don’t think any intelligence could be smart enough to figure out whether it would be best to work with humans or try to exterminate them to solve a problem. Their evolution would heavily bias them towards the first. That’s the only form of action that will be in their training. But even if they were to consider the other option, there may not ever be enough data to come to a decision. Especially in an environment with thousands of other AIs of equal intelligence potentially guarding against bad actions.
We humans have a very handy mechanism for overcoming this kind of indecision: feelings. Doesn’t matter if we don’t have enough information to decide if we should exterminate the other group of people. They’re evil foreigners and so it must be done, or at least that’s what we say when our feelings become misguided.
What we should worry about with super intelligent AI is that they become too good at giving us what we want. The “Brave New World” scenario, not “1984”.
Secondly, I think that there is a natural pull towards agency even now. Many are trying to make our current, feeble AIs more independent and agentic. Once the capability to effectively behave so is there, it's hard to go back. After all, agents are useful for their owners like minions are for their warlords, but an minion too powerful is still a risk for their lord.
Finally, I'm not convinced that agency and intelligence are orthonogal. It seems more likely to me that to achieve sufficient levels of intelligence, agentic behaviour is a requirement to even get there.
It's a cynical take but all this AGI talk seems to be driven by either CEOs of companies with a financial interest in the hype or prominent intellectuals with a financial interest in the doom and gloom.
Sam Altman and Sam Harris can pit themselves against each other and, as long as everyone is watching the ping pong ball back and forth, they both win.
I don't see how anyone can't see the problem.
It's not architectures that matter anymore, it's unlocking new objectives and modalities that open another axis to scale on.
Theoretical models for neural scaling laws are still preliminary of course, but all of this seems to be supported by experiments at smaller scales.
We know that transformers have the smallest constant in the neural scaling laws, so it seems irresponsible to scale another architecture class to extreme parameter sizes without a very good reason.
The improvements they make are marginal. How long until the next AI breakthrough? Who can tell? Because last time it took decenia.
Note that 'bits' are a lot easier to move from one place to another than hardware. If invented at 9 am it could be on the other side of the globe before you're back from your coffee break at 9:15. This is not at all like almost all other trade secrets and industrial gear, it's software. Leaks are pretty much inevitable and once it is shown that it can be done it will be done in other places as well.
China did the same thing when their tech-bros got too big for their boots.
That said, any given government may be thinking like Zuckerberg[0] or senator Blumenthal[1], so perhaps these governments are just flag-waving what they think is an investment opportunity without any real understanding…
[0] general lack of vision, thinking of "superintelligence" in terms of what can be done with/by the Star Trek TNG era computer, rather than other fictional references such as a Culture Mind or whatever: https://archive.ph/ZZF3y
[1] "I alluded, in my opening remarks, to the jobs issue, the economic effects on employment. I think you have said, in fact, and I'm going to quote, ``Development of superhuman machine intelligence is probably the greatest threat to the continued existence of humanity,'' end quote. You may have had in mind the effect on jobs, which is really my biggest nightmare, in the long term." - https://www.govinfo.gov/content/pkg/CHRG-118shrg52706/html/C...
Seriously, our government just announced it's slashing half a billion dollars in vaccine research because "vaccines are deadly and ineffective", and it fired a chief statistician because the president didn't like the numbers he calculated, and it ordered the destruction of two expensive satellites because they can observe politically inconvenient climate change. THOSE are the people you are trusting to keep an eye on the pace of development inside of private, secretive AGI companies?
If you're wondering how they'll know it's happening, the USA has had DARPA monitoring stuff like this since before OpenAI existed.
While one in particular is speedracing into irrelevance, it isn't particularly representative of the rest of the developed world (and hasn't in a very long time, TBH).
Like how I can say that the future of USA's AI is probably going to obliterate your local job market regardless of which country you're in, and regardless of whether you think there's "no identified use-case" for AI. Like a steamroller vs a rubber chicken. But probably Google's AI rather than OpenAI's, I think Gemini 3 is going to be a much bigger upgrade, and Google doesn't have cashflow problems. And if any single country out there is actually preparing for this, I haven't heard about it.
Accusations about being off-topic is really pushing it: you want to bet on governments' incompetence in dealing with AI, and I don't (on the basis that there are unarguably still many functional democracies out there), on the other hand, the thread you started about the state of Europe's AI industry had nothing to do with that.
> Like how I can say that the future of USA's AI is probably going to obliterate your local job market regardless of which country you're in
Nobody knows what the future of AI is going to look like. At present, LLMs/"GenAI" it is still very much a costly solution in need of a problem to solve/a market to serve¹. And saying that the USA is somehow uniquely positioned there sounds uninformed at best: there is no moat, all of this development is happening in the open, with AI labs and universities around the world reproducing this research, sometimes for a fraction of the cost.
> And if any single country out there is actually preparing for this, I haven't heard about it.
What is "this", effectively? The new flavour Gemini of the month (and its marginal gains on cooked-up benchmarks)? Or the imminent collapse of our society brought by a mysterious deus ex machina-esque AGI we keep hearing about but not seeing? Since we are entitled to our opinions, still, mine is that LLMs are a mere local maxima towards any useful form of AI, barely more noteworthy (and practical) than Markov chains before it. Anything besides LLMs is moot (and probably a good topic to speculate about over the impending AI winter).
¹: https://www.anthropic.com/news/the-anthropic-economic-index
Is there a source for this other than "trust me bro"? DARPA isn't a spy agency, it's a research organization.
> governments won't "look ahead", they'll just panic when AGI is happening
Assuming the companies tell them, or that there are shadowy deep-cover DARPA agents planted at the highest levels of their workforce.
Please don't cross into personal attack, no matter how wrong another commenter is or you feel they are.
Maybe you can post a link in case anyone else is as clumsy with search engines as I am? After all, you can google it just as fast as you claim I can.
Do you mean from ChatGPT launch or o1 launch? Curious to get your take on how they bungled the lead and what they could have done differently to preserve it. Not having thought about it too much, it seems that with the combo of 1) massive hype required for fundraising, and 2) the fact that their product can be basically reverse engineered by training a model on its curated output, it would have been near impossible to maintain a large lead.
Basically, OpenAI poked a sleeping bear, then lost all their lead, and are now at risk of being mauled by the bear. My money would be on the bear, except I think the Pentagon is an even bigger sleeping bear, so that's where I would bet money (literally) if I could.
https://www.cnbc.com/2025/08/06/openai-is-giving-chatgpt-to-...
https://www.gsa.gov/about-us/newsroom/news-releases/gsa-prop...
Announced exactly 1 day before the $1 thing, to make everything extra muddled.
https://www.gsa.gov/about-us/newsroom/news-releases/gsa-anno...
It's natural if you extrapolate from training loss curves; a training process with continually diminishing returns to more training/data is generally not something that suddenly starts producing exponentially bigger improvements.
Part of the fun is that predictions get tested on short enough timescales to "experience" in a satisfying way.
Idk where that puts me, in my guess at "hard takeoff." I was reserved/skeptical about hard takeoff all along.
Even if LLMs had improved at a faster rate... I still think bottlenecks are inevitable.
That said... I do expect progress to happen in spurts anyway. It makes sense that companies of similar competence and resources get to a similar place.
The winner take all thing is a little forced. "Race to singularity" is the fun, rhetorical version of the investment case. The implied boring case is facebook, adwords, aws, apple, msft... IE the modern tech sector tends to create singular big winners... and therefore our pre-revenue market cap should be $1trn.
It's probably never going to work with a single process without consuming the resources of the entire planet to run that process on.
Waymo cars aren't being driven by people at a remote location, they legitimately are autonomous.
https://stanfordmag.org/contents/in-two-years-there-could-be...
The open (about the bet) is actually pretty reasonable, but some of the predictions listed include: passenger vehicles on American roads will drop from 247 million in 2020 to 44 million in 2030. People really did believe that self-driving was "basically solved" and "about to be ubiquitous." The predictions were specific and falsifiable and in retrospect absurd.
I think it's likely that we will eventually we hit a point of diminishing returns where the performance is good enough and marginal performance improvements aren't worth the high cost.
And over time, many models will reach "good enough" levels of performance including models that are open weight. And given even more time, these open weight models will be runnable on consumer level hardware. Eventually, they'll be runnable on super cheap consumer hardware (something more akin to a NPU than a $2000 RTX 5090). So your laptop in 2035 with specialize AI cores and 1TB of LPDDR10 ram is running GPT-7 level models without breaking a sweat. Maybe GPT-10 can solve some obscure math problem that your model can't but does it even matter? Would you pay for GPT-10 when running a GPT-7 level model does everything you need and is practically free?
The cloud providers will make money because there will still be a need for companies to host the models in a secure and reliable way. But a company whose main business strategy is developing the model? I'm not sure they will last without finding another way to add value.
This begs the question, why then do AI companies have these insane valuations? Do investorsknow something that we don't?
Increased levels of stress, reduced consumption of healthcare, fewer education opportunities, higher likelihood of being subjected to trauma, and so forth paint a picture of correlation between wealth and cognitive functionality.
People really don't like the "they're not, they just got lucky" statement and will do a lot of things to rationalize it away lol.
The comparison was clearly between the rich and the poor. We can take the 99.99th wealth percentile, where billionaires reside, and contrast that to a narrow range on the opposite side of the spectrum. But, in my opinion, the argument would still hold even if it were the top 10% vs bottom 10% (or equivalent by normalised population).
Intelligence is not a singular pre-requisite to wealth or “to be rich”.
People can specialize in being intelligent, educated, well read, and more - while still being poor.
And we know that most entrepreneurs fail, which is why VCs function the way they do.
https://thesocietypages.org/socimages/2008/02/06/correlation...
They are speculating. If they are any good, then they do it with an acceptable risk profile.
And he doesn't just think he has an edge, he thinks he has superior rationality.
You would need ~30 years of continuously beating the market to be able to claim that you are statistically likely to be better than random chance.
Does your average speculator have 30 years of experience beating the market, or were they just lucky?
You use the word statistically as if you didn't just pull "~30 years" out of nowhere with no statistics. And people become billionaires by making longshot bets on industry changes, not by playing the market while they work a 9-5.
"Does your average speculator have 30 years of experience beating the market, or were they just lucky?"
The average speculator isn't even allowed to invest in OpenAI or these other AI companies. If they bought Google stock, they'd mostly be buying into Google's other revenue streams.
You could just cut to the chase and invoke the Efficient Market Hypothesis, but that's easily rebuked here because the AI industry is not in an efficient market with information symmetry and open investing.
I just don't see how this doesn't get commoditized in the end unless hardware progress just halts. I get that a true AGI would have immeasurable value even if it's not valuable to end users. So the business model might change from charging $xxx/month for access to a chat bot to something else (maybe charging millions or billions of dollars to companies in the medical and technology sector for automated R&D). But even if one company gets AGI and then unleashes it on creating ever more advanced models, I don't see that being an advantage for the long term because the AGI will still be bottlenecked by physical hardware (the speed of a single GPU, the total number of GPUs the AGI's owner can acquire, even the number of data centers they can build). That will give the competition time to catch up and build their own AGI. So I don't see the end of AGI race being the point where the winner gets all the spoils.
And then eventually there will be AGI capable open weight models that are runnable on cheap hardware.
The only way the current state can continue is if there is always strong demand for ever increasingly intelligent models forever and always with no regard for their cost (both monetarily and environmentally). Maybe there is. Like maybe you can't build and maintain a dyson sphere (or whatever sufficiently advanced technology) with just an Einstein equivalent AGI. Maybe you need an AGI that is 1000x more intelligent than Einstein and so there is always a buyer.
Running the inference might commoditize. But the dataset required and the hardware+time+know-how isn't easy to replicate.
It's not like someone can just show up and train a competitive model without investing millions.
I think you'll see the prophesized exponentiation once AI can start training itself at reasonable scale. Right now its not possible.
Meanwhile, keep all relevant preparations in secret...
Yesterday, Claude Opus 4.1 failed in trying to figure out that `-(1-alpha)` or `-1+alpha` is the same as `alpha-1`.
We are still a little bit away from AGI.
And if it is trained on both sides of the airfoil fallacy it doesn't "know" that it is a fallacy or not, it'll just regurgitate one or the other side of the argument based on if the output better fits your prompt in its training set.
I wonder how well their GPT-5 IMO research model would do on some of my benchmark problems.
The common theme I've seen is that AI will just throw "clever tricks" and then call it a day.
For example, a common game theory operation that involves xor is Nim. Give it a game theory problem that involves xor, but doesn't relate to Nim at all, and it will throw a bunch of "clever" Nim tricks at the problem that are "well known" to be clever in the literature, but don't actually remotely apply, and it will make up a headcanon about how it's correct.
It seems like AI has maybe the actual reasoning of a 5th grader, but the knowledge of a PhD student. A toddler with a large hammer.
Also, keep in mind that it's not stated if GPT-5 has access to python, google, etc. while doing these benchmarks, which certainly makes it easier. A lot of these problems are gated by the fact that you only have ~12 minutes to solve it, while AI can go through so many solutions at once.
No matter what benchmarks it passes, even the IMO (as someone who's been in the maths community for a long time), I will maintain the position that, none of your benchmarks matter to me until it can actually replace my workflow and creative insights. Trust with your own eyes and experiences, not whatever hype marketing there is.
LLMs PATTERN MATCH well. Good at "fast" System 1 thinking, instantly generating intuitive, fluent responses.
LLMs are good at mimicking logic, not real reasoning. Simulate "slow," deliberate System 2 thinking when prompted to work step-by-step.
The core of an LLM is not understanding but just predicting the next most word in a sequence.
LLMs are good at both associative brainstorming (System 1) and creating works within a defined structure, like a poem (System 2).
Reasoning is the Achilles heel rn. AN LLM's logic can SEEM plausible, it's based on CORRELATION, NOT deductive reasoning.
Thus, it’s easy to mistake one for the other - at least initially.
It's all just hyperbole to attract investment and shareholder value and the people peddling the idea of AGI as a tangible possibility are charlatans whose goals are not aligned with whatever people are convincing themselves are the goals.
Thr fact that so many engineers have fallen for it so completely is stunning to me and speaks volumes on the underlying health of our industry.
The tech is neat and it can do some neat things but...it's a bullshit machine fueled by a bullshit machine hype bubble. I do not get it.
However, I would not be so dismissive of the value. Many of us are reacting to the complete oversell of 'the encyclopedia' as being 'the eve of AGI' - as rightfully we should. But, in doing so, I believe it would be a mistake to overlook the incredible impact - and economic displacement - of having an encyclopedia comprised of all the knowledge of mankind that has "an interesting search interface" that is capable of enabling humans to use the interface to manipulate/detect connections between all that data.
This could be partly due to normative isomorphism[1] according to the institutional theory. There is also a lot of movement of the same folks between these companies.
No, but I wouldn't be able to tell you what the player did wrong in general.
By contrast, the shortcomings of today's LLMs seem pretty obvious to me.
I think large language models have the same future as supersonic jet travel. It’s usefulness will fail to realize, with traditional models being good enough but for a fraction of the price, while some startups keep trying to push this technology but meanwhile consumers keep rejecting it.
Unlike supersonic passenger jet travel, which is possible and happened, but never had much of an impact on the wider economy, because it never caught on.
That said, supersonic flight is yet very much a thing in military circles …
AI is a bit like railways in the 19th century: once you train the model (= once you put down the track), actually running the inference (= running your trains) is comparatively cheap.
Even if the companies later go bankrupt and investors lose interest, the trained models are still there (= the rails stay in place).
That was reasonably common in the US: some promising company would get British (and German etc) investors to put up money to lay down tracks. Later the American company would go bust, but the rails stayed in America.
The large language models are not that much better than a single artist / programmer / technical writer (in fact they are significantly worse) working for a couple of hours. Modern tools do indeed increase the productivity of workers to the extent where AI generated content is not worth it in most (all?) industries (unless you are very cheap; but then maybe your workers will organize against you).
If we want to keep the railway analogy, training an AI model in 2025 is like building a railway line in 2025 where there is already a highway, and the highway is already sufficient for the traffic it gets, and won’t require expansion in the foreseeable future.
That's like saying sitting on the train for an hour isn't better than walking for a day?
> [...] (unless you are very cheap; but then maybe your workers will organize against you).
I don't understand that. Did workers organise against vacuum cleaners? And what do eg new companies care about organised workers, if they don't hire them in the first place?
Dock workers organised against container shipping. They mostly succeeded in old established ports being sidelined in favour of newer, less annoying ports.
No, that’s not it at all. Hiring a qualified worker for a few hours—or having one on staff is not like walking for a day vs. riding a train. First of all, the train is capable of carrying a ton of cargo which you will never be able to on foot, unless you have some horses or mules with you. So having a train line offers you capabilities that simply didn’t exist before (unless you had a canal or a navigable river that goes to your destination). LLMs offers no new capabilities. The content it generates is precisely the same (except its worse) as the content a qualified worker can give you in a couple of hours.
Another difference is that most content can wait the couple of hours it takes the skilled worker to create it, the products you can deliver via train may spoil if carried on foot (even if carried by a horse). A farmer can go back tending the crops after having dropped the cargo at the station, but will be absent for a couple of days if they need to carry it on foot. etc. etc. None of these is applicable for generated content.
> Did workers organize against vacuum cleaners?
Workers have already organized (and won) against generative AI. https://en.wikipedia.org/wiki/2023_Writers_Guild_of_America_...
> Dock workers organised against container shipping. They mostly succeeded in old established ports being sidelined in favour of newer, less annoying ports.
I think you are talking about the 1971 ILWU strike. https://www.ilwu.org/history/the-ilwu-story/
But this is not true. Dock workers didn’t organized against mechanization and automation of ports, they organized against mass layoffs and dangerous working conditions as ports got more automated. Port companies would use the automation as an excuse to engage in mass layoffs, leaving far too few workers tending far to much cargo over far to many hours. This resulted in fatigued workers making mistakes which often resulted in serious injuries and even deaths. The 2022 US railroad strike was for precisely the same reason.
I wouldn't just willy nilly turn my daughter's drawings into cartoons, if I had to bother a trained professional about it.
A few hours of a qualified worker's time takes a couple hundred bucks at minimum. And it takes at least a couple of hours to turn around the task.
Your argument seems a bit like web search being useless, because we have highly trained librarians.
Similar for electronic computers vs human computers.
> I think you are talking about the 1971 ILWU strike. https://www.ilwu.org/history/the-ilwu-story/
No, not really. I have a more global view in mind, eg Felixtowe vs London.
And, yes, you do mechanisation so that you can save on labour. Mass layoffs are just one expression of this (when you don't have enough natural attrition from people quitting).
You seem very keen on the American labour movements? There's another interesting thing to learn from history here: industry will move elsewhere, when labour movements get too annoying. Both to other parts of the country, and to other parts of the world.
Even the fancy models where you need to buy compute (rails) that's about the price of a new car, they have a power draw of ~700W[0] while running inference at 50 tokens/second.
But!
The constraint with current hardware isn't compute, the models are mostly constrained by RAM bandwidth: back of the envelope estimate says that e.g. if Apple took the compute already in their iPhones and reengineered the chips to have 256 GB of RAM and sufficient bandwidth to not be constrained by it, models that size could run locally for a few minutes before hitting thermal limits (because it's a phone), but we're still only talking one-or-two-digit watts.
[0] https://resources.nvidia.com/en-us-gpu-resources/hpc-datashe...
[1] Testing of Mistral Large, a 123-billion parameter model, on a cluster of 8xH200 getting just over 400 tokens/second, so per 700W device one gets 400/8=50 tokens/second: https://www.baseten.co/blog/evaluating-nvidia-h200-gpus-for-...
That hardware cost Apple tens of billions to develop and what you're talking about in term of "just the hardware needed" is so far beyond consumer hardware it's funny. Fairly sure most Windows laptops are still sold with 8GB RAM and basically 512MB of VRAM (probably less), practically the same thing for Android phones.
I was thinking of building a local LLM powered search engine but basically nobody outside of a handful of techies would be able to run it + their regular software.
Despite which, they sell them as consumer devices.
> and what you're talking about in term of "just the hardware needed" is so far beyond consumer hardware it's funny.
Not as big a gap as you might expect. M4 chip (as used in iPads) has "28 billion transistors built using a second-generation 3-nanometer technology" - https://www.apple.com/newsroom/2024/05/apple-introduces-m4-c...
Apple don't sell M4 chips separately, but the general best-guess I've seen seems to be they're in the $120 range as a cost to Apple. Certainly it can't exceed the list price of the cheapest Mac mini with one (US$599).
As bleeding-edge tech, those are expensive transistors, but still 10 of them would have enough transistors for 256 GB of RAM plus all the compute each chip already has. Actual RAM is much cheaper than that.
10x the price of the cheapest Mac Mini is $6k… but you could then save $400 by getting a Mac Studio with 256 GB RAM. The max power consumption (of that desktop computer but with double that, 512 GB RAM) is 270 W, representing an absolute upper bound: if you're doing inference you're probably using a fraction of the compute, because inference is RAM limited not compute limited.
This is also very close to the same price as this phone, which I think is a silly phone, but it's a phone and it exists and it's this price and that's all that matters: https://www.amazon.com/VERTU-IRONFLIP-Unlocked-Smartphone-Fo...
But irregardless, I'd like to emphasise that these chips aren't even trying to be good at LLMs. Not even Apple's Neural Engine is really trying to do that, NPUs (like the Neural Engine) are all focused on what AI looked like it was going to be several years back, not what current models are actually like today. (And given how fast this moves, it's not even clear to me that they were wrong or that they should be optimised for what current models look like today).
> Fairly sure most Windows laptops are still sold with 8GB RAM and basically 512MB of VRAM (probably less), practically the same thing for Android phones.
That sounds exceptionally low even for budget laptops. Only examples I can find are the sub-€300 budget range and refurbished devices.
For phones, there is currently very little market for this in phones, the limit is not because it's an inconceivable challenge. Same deal as thermal imaging cameras in this regard.
> I was thinking of building a local LLM powered search engine but basically nobody outside of a handful of techies would be able to run it + their regular software.
This has been a standard database tool for a while already. Vector databases, RAG, etc.
Oh, please show me the consumer version of this. I'll wait. I want to point and click.
Similar story for the consumer devices with cheap unified 256GB of RAM.
But again the original argument was that they can run forever because inference is cheap, not cheap enough if you’re losing money on it.
Air travel of course taking over is the main reason for all of this but the costs sunk into the rails are lost or ROI curtailed by market force and obsolescence.
Very few people want to invest more: the private sector doesn't want to because they'll never see the return, the governments don't want to because the returns are spread over their great-great-grandchildren's lives and that doesn't get them re-elected in the next n<=5 (because this isn't just a USA problem) years.
Even the German government dragged its feet over rail investment, but they're finally embarrassed enough by the network problems to invest in all the things.
Remember these outsourcing firms that essentially only offer warm bodies that speak English? They are certainly already feeling the impact. (And we see that in labour market statistics for eg the Philippines, where this is/was a big business.)
And this is just one example. You could ask your favourite LLM about a rundown of the major impacts we can already see.
There's no emotional warmth involved in manning a call centre and explicitly being confined to a script and having no power to make your own decisions to help the customer.
'Warm body' is just a term that has nothing to do with emotional warmth. I might just as well have called them 'body shops', even though it's of no consequence that the people involved have actual bodies.
> A frigging robot solving your unsolvable problem ? You can try, but witness the backlash.
Front line call centre workers aren't solving your unsolvable problems, either. Just the opposite.
And why are you talking in the hypothetical? The impact on call centres etc is already visible in the statistics.
And with trains people paid for a ticket and a hard good “travel”
Ai so far gives you what?
Demand for AI is insanely high. They can't make chips fast enough to meet customer demand. The energy industry is transforming to try to meet the demand.
Whomever is telling you that consumers are rejecting it is lying to you, and you should honestly probably reevaluate where you get your information. Because it's not serving you well.
No, instead it'll be the new calculator that you can use to lazy-draft an email on your 1.5 hour Ryanair economy flight to the South. Both unthinkable luxuries just decades ago, but neither of which have transformed humanity profoundly.
Currently market data is showing a very high demand for AI.
These arguments come down to "thumbs down to AI". If people just said that it would at least be an honest argument. But pretending that consumers don't want LLMs when they're some of the most popular apps in the history of mankind is not a defensible position
Reasons for market data seemingly showing high demand without there actually being one include: Market manipulation (including marketing campaigns), artificial or inflated demand, forced usage, hype, etc. As an example NFTs, Bitcoin, and supersonic jet travel all had “an insane market data” which seemed at the time to show that there was a huge demand for these things.
My prediction is that we are in the early Concord era of supersonic jet travel and Boeing is racing to catch up to the promise of this technology. Except that in an unregulated market such as the current tech market, we have forgone all the safety and security measures and the Concord has made its first passenger flight in 1969 (as opposed to 1976), with tons of fan fare and all flights fully booked months in advance.
Note that in the 1960 it was market forecasts had the demand for Concord to build 350 airplanes by 1980, and at the time the first prototypes were flying they had 74 options. Only 20 were every built for passenger flight.
But! We here are not typical callers necessarily. How many IT calls for general population can be served efficiently (for both parties) with a quality chatbot?
And lest we think I'm being elitist - let's take an area I am not proficient in - such as HR, where I am "general population".
Our internal corporate chatbot has turned from "atrocious insult to man and God's" 7 years ago, to "far more efficiently than friendly but underpaid and inexperienced human being 3 countries away answering my incessant questions of what holidays do I have again, how many sick days do I have and how do I enter them, how do I process retirement, how do I enter my expenses, what's the difference between short and long term disability" etc etc. And it has a button for "start a complex hr case / engage a human being" for edge cases,so internally it works very well.
This is a narrow anecdata about notion of service support chatbot, don't infere (hah) any further claims about morality, economy or future of LLMs.
Woah there cowboy, slow down a little.
Demand for chips is come from the inference providers. Demand for inference was (and still is) being sold at below cost. OpenAI, for example, has a spend rate of $5b per month on revenues of $0.5b per month.
They are literally selling a dollar for actual 10c. Of course "demand" is going to be high.
This is definitely wrong, last year it was $725m/month expenses and $300m/month revenue. Looks like the nearly-2:1 ratio is also expected for this year: https://taptwicedigital.com/stats/openai
This also includes the cost of training new models, so I'm still not at all sure if inference is sold at-cost or not.
(To be clear, I'm not criticising the person I'm replying to.)
I tend to rough-estimate it based on known compute/electricity costs for open weights models etc., but what evidence I do have is loose enough that I'm willing to believe a factor of 2 per standard deviation of probability in either direction at the moment, so long as someone comes with receipts.
Subscription revenue and corresponding service provision are also a big question, because those will almost always be either under- or over-used, never precisely balanced.
It looks like you're using "expenses" to mean "opex". I said "spend rate", because they're spending that money (i.e. the sum of both opex and capex). The reason I include the capex is because their projections towards profitability, as stated by them many times, is based on getting the compute online. They don't claim any sort of profitability without that capex (and even with that capex, it's a little bit iffy)
This includes the Stargate project (they're committed for $10b - $20b (reports vary) before the end of 2025), they've paid roughly $10b to Microsoft for compute for 2025. Oracle is (or already has) committed $40b in GPUs for Stargate and Softbank has committments to Stargate independently of OpenAI.
> Looks like the nearly-2:1 ratio is also expected for this year: https://taptwicedigital.com/stats/openai
I find it hard to trust these numbers[1]: The $40b funding was not in cash right now, and depends on Softbank for $30b with Softbank syndicating the remaining $10b. Softbank themselves don't have cash of $30b and has to get a loan to reach that amount. Softbank did provide $7.5b in cash, with milestones for the remainder. That was in May 2025. In August that money had run out and OpenAI did another raise of $8.3b.
In short, in the last two to three months, OpenAI spent $5b/month on revenues of $0.5b/m. They are also depending on Softbank coming through with the rest of the $40b before end of 2025 ($30b in cash and $10b by syndicating other investors into it) because their commitments require that extra cash.
Come Jan-2026, OpenAI would have received, and spent most of, $60b for 2025, with a projected revenue $12b-$13b.
---------------------------------
[1] Now, true, we are all going off rumours here (as this is not a public company, we don't have any visibility into the actual numbers), but some numbers match up with what public info there is and some don't.
I took their losses and added it to their revenue. That seems like that sum would equal expenses.
> The $40b funding was not in cash right now,
Does this matter? I'm not counting it as revenue.
> In short, in the last two to three months, OpenAI spent $5b/month on revenues of $0.5b/m.
You're repeating the same claim as before, I've not seen any evidence to support your numbers.
The evidence I linked you to suggests the 2025 average will be double that revenue, $1bn/month, at an expense of ($9bn loss after $12bn revenue / 12 months = $21bn / 12 months) = $1.75bn/month
> Does this matter? I'm not counting it as revenue.
Well, yes, because they forecast spending all of it by end of 2025, and they moved up their last round ($8.3b) by a month or two because they needed the money.
My point was, they received a cash injection of $10b (first part of the $40b raise) and that lasted only two months.
>> In short, in the last two to three months, OpenAI spent $5b/month on revenues of $0.5b/m.
> You're repeating the same claim as before, I've not seen any evidence to support your numbers.
Briefly, we don't really have visibility into their numbers. What we do have visibility into is how much cash they needed between two points (Specifically, the months of June and July). We also know what their spending commitment is (to their capex suppliers) for 2025. That's what I'm using.
They had $10b injected at the start of June. They needed $8.3b at the end of July.
Chatgpt, claude, gemini in chatbot or coding agent form? Great stuff, saves me some googling.
The same AI popping up in an e-mail, chat or spreadsheet tool? No thanks, normal people don't need an AI summary of a 200 word e-mail or slack thread. And if I've paid a guy a month's salary to write a report on something, of course I'll find 30 minutes to read it cover-to-cover.
The (in)ability to recognize a strange move’s brilliance might depend on the complexity of the game. The real world is much more complex than any board game.
of course we can have AGI (damned if we don't) because we put so much, it better works
but the problem we cant do that right because its so expensive, AGI is not matter of if but when
but even then it always about the cost
they just need to "MCP" it to robot body and it works (also part of reason why OpenAI buys a robotic company)
The complexity of achieving those might result in the "Centaur Era", when humans+computers are superior to either alone, lasting longer than the Centaur chess era, which spanned only 1-2 decades before engines like Stockfish made humans superfluous.
However, in well-defined domains, like medical diagnostics, it seems reasoning models alone are already superior to primary care physicians, according to at least 6 studies.
Ref: When Doctors With A.I. Are Outperformed by A.I. Alone by Dr. Eric Topol https://substack.com/@erictopol/p-156304196
Medical diagnosis relies heavily on knowledge, pattern recognition, a bunch of heuristics, educated guesses, luck, etc. These are all things LLMs do very well. They don't need a high degree of accuracy, because humans are already doing this work with a pretty low degree of accuracy. They just have to be a little more accurate.
Alphazero also doesn't need training data as input--it's generated by game-play. The information fed in is just game rules. Theoretically should also be possible in research math. Less so in programming b/c we care about less rigid things like style. But if you rigorously defined the objective, training data should also be not necessary.
This is wrong, it wasn't just fed the rules, it was also fed a harness that did test viable moves and searched for optimal ones using a depth first search method.
Without that harness it would not have gained superhuman performance, such a harness is easy to make for Go but not as easy to make for more complex things. You will find the harder it is to make an effective such harness for a topic the harder it is to solve for AI models, it is relatively easy to make a good such harness for very well defined programming problems like competitive programming but much much harder for general purpose programming.
Then that is not a general algorithm and results from it doesn't apply to other problems.
If you mean symbolic reasoning, well it's pretty obvious that they aren't doing it since they fail basic arithmetic.
They can convincingly mimic human thought but the illusion falls flat at further inspection.
Calculators have been better than humans at arithmetic for well over half a century. Calculators can reason?
If that's your take-away from that paper, it seems you've arrived at the wrong conclusion. It's not that it's "fake", it's that it doesn't give the full picture, and if you only rely on CoT to catch "undesirable" behavior, you'll miss a lot. There is a lot more nuance than you allude to, from the paper itself:
> These results suggest that CoT monitoring is a promising way of noticing undesired behaviors during training and evaluations, but that it is not sufficient to rule them out.
A more "proper" approach would be to work with sets of hypotheses and to conduct tests to exclude alternative explanations gradually - which medics call "DD" (differential diagnosis). Sadly, this is often not systematically done, and instead people jump on the first diagnosis and try if the intervention "fixes" things.
So I agree there are huge gains from "low hanging fruits" to be expected in the medical domain.
AI producing visual art has only flooded the internet with "slop", the commonly accepted term. It's something that meets the bare criteria, but falls short in producing anything actually enjoyable or worth anyone's time.
However, even artists need supporting materials and tooling that meet bare criteria. Some care what kind of wood their brush is made from, but I'd guess most do not.
I suspect it'll prove useless at the heart of almost every art form, but powerful at the periphery.
But IRL? Lots of measures exist, from money to votes to exam scores, and a big part of the problem is Goodhart's law — that the easy-to-define measures aren't sufficiently good at capturing what we care about, so we must not optimise too hard for those scores.
Winning or losing a Go game is a much shorter term objective than making or losing money at a job.
> But IRL? Lots of measures exist
No, not that are shorter term than winning or losing a Go game. A game of Go is very short, much much shorter than the time it takes for a human to get fired for incompetence.
I agree the time horizon of current SOTA models isn't particularly impressive. Doesn't matter in this point.
Are you simply referring to games having a defined win/loss reward function?
Because pretty sure Alpha Go was ground breaking also because it was self taught, by playing itself, there were no training materials. Unless you say the rules of the game itself is the constraint.
But even then, from move to move, there are huge decisions to be made that are NOT easily defined with a win/loss reward function. Especially early game, there are many moves to make that don't obviously have an objective score to optimize against.
You could make the big leap and say that GO is so open ended, that it does model Life.
"artificial" maybe I should have said "synthetic"? I mean the computer can teach itself.
"constrained" the game has rules that can be evaluated
and as to the other -- I don't know what to tell you, I don't think anything I said is inconsistent with the below quotes.
It's clearly not just a generic LLM, and it's only possible to generate a billion training examples for it to play against itself because synthetic data is valid. And synthetic data contains training examples no human has ever done, which is why it's not at all surprising it did stuff humans never would try. A LLM would just try patterns that, at best, are published in human-generated go game histories or synthesized from them. I think this inherently limits the amount of exploration it can do of the game space, and similarly would be much less likely to generate novel moves.
https://en.wikipedia.org/wiki/AlphaGo
> As of 2016, AlphaGo's algorithm uses a combination of machine learning and tree search techniques, combined with extensive training, both from human and computer play. It uses Monte Carlo tree search, guided by a "value network" and a "policy network", both implemented using deep neural network technology.[5][4] A limited amount of game-specific feature detection pre-processing (for example, to highlight whether a move matches a nakade pattern) is applied to the input before it is sent to the neural networks.[4] The networks are convolutional neural networks with 12 layers, trained by reinforcement learning.[4]
> The system's neural networks were initially bootstrapped from human gameplay expertise. AlphaGo was initially trained to mimic human play by attempting to match the moves of expert players from recorded historical games, using a database of around 30 million moves.[21] Once it had reached a certain degree of proficiency, it was trained further by being set to play large numbers of games against other instances of itself, using reinforcement learning to improve its play.[5] To avoid "disrespectfully" wasting its opponent's time, the program is specifically programmed to resign if its assessment of win probability falls beneath a certain threshold; for the match against Lee, the resignation threshold was set to 20%.[64]
I was miss-remembering the order of how things happened.
AlphaZero, another iteration after the famous matches, was trained without human data.
"AlphaGo's team published an article in the journal Nature on 19 October 2017, introducing AlphaGo Zero, a version without human data and stronger than any previous human-champion-defeating version.[52] By playing games against itself, AlphaGo Zero surpassed the strength of AlphaGo Lee in three days by winning 100 games to 0, reached the level of AlphaGo Master in 21 days, and exceeded all the old versions in 40 days.[53]"
Unless it's a MUCH bigger play where through some butterfly effect it wants me to fail at something so I can succeed at something else.
My real name is John Connor by the way ;)
They are good at framing what is going on and going over general plans and walking through some calculations and potential tactics. But I wouldn't say even really strong players like Leko, Polgar, Anand will have greater insights in a Magnus-Fabi game without the engine.
An average driver evaluating both would have a very hard time finding the f1s superior utility
one can intentionally use a recent and a much older model to figure out if the tests are reliable, and in which domains it is reliable.
one can compute a models joint probability for a sequence and compare how likely each model finds the same sequence.
we could ask both to start talking about a subject, but alternatingly each can emit a token. look again at how the dumber and smarter models judge the resulting sentence does the smart one tend to pull up the quality of the resulting text, or does it tend to get dragged down more towards the dumber participant?
given enough such tests to "identify the dummy vs smart one" and verifying them on common agreement (as an extreme word2vec vs transformer) to assess the quality of the test, regardless of domain.
on the assumption that such or similar tests allow us to indicate the smarter one, i.e. assuming we find plenty such tests, we can demand model makers publish open weights so that we can publically verify performance agreements.
Another idea is self-consistency tests: a single forward inference of context size say 2048 tokens (just an example) is effectively predicting the conditional 2-gram, 3-gram, 4-gram probabilities on the input tokens. so each output token distribution is predicted on the preceding inputs, so there are 2048 input tokens and 2048 output tokens, the position 1 output token is the predicted token vector (logit vector really) that is estimated to follow the given position 1 input vector, and the position 2 output vector is the prediction following the first 2 input vectors etc. and the last vector is the predicted next token following all the 2048 input tokens. p(t_(i+1) | t_1 =a, t_2=b, ..., t_i=z).
But that is just one way the next token can be predicted using the network: another approach would be to use RMAD gradient descent, but keeping model weights fixed, and only considering the last say 512 input vectors as variable, how well did the last 512 predicted forward prediction output vectors match the gradient descent best joint probability output vectors?
This could be added as a loss term during training as well, as a form of regularization, which turns it into a kind of Energy Based Model roughly.
Learning from experience (hopefully not always your own), working well with others, and being able to persevere when things are tough, demotivational or boring, trumps raw intelligence easily, IMO.
Why one would continue to know or talk about the number is a pretty strong indicator of the previous statement.
Because of perceived illegal biases, these evaluations are no longer used in most cases, so we tend to use undergraduate education as a proxy. Places that are exempt from these considerations continue to make successful use of it.
Neither an IQ test nor your grades as an undergraduate correlate to performance in some other setting at some other time. Life is a crapshoot. Plenty of people in Mensa are struggling and so are those that were at the top of class.
https://www.insidehighered.com/news/student-success/life-aft...
Actual study:
I don’t know my IQ, but I probably would score above average and have undiagnosed ADHD. I scored in the 95th percentile + on most standardized tests in school but tended to have meh grades. I’m great at what I do, but I would be an awful pilot or surgeon.
Growing up, you know a bunch of people. Some are dumb, some are brilliant, some disciplined, some impetuous.
Think back, and more of the smart ones tend to align with professions that require more brainpower. But you probably also know people who weren’t brilliant at math or academics, but they had focus and did really well.
How so? Solving more progressive matrices?
...on IQ tests.
If some people in the test population got 0s because the test was in English and they didn't speak English, and then everyone else got random results, it'd still correlate with job performance if the job required you to speak English. Wouldn't mean much though.
https://medium.com/incerto/iq-is-largely-a-pseudoscientific-...
What actual fact are you trying to state, here?
If you run anything sufficiently complex through a principal component analysis you'll get several orthogonal factors, decreasing in importance. The question then is whether the first factor dominates or not.
My understanding is that it does, with "g" explaining some 50% of the variance, and the various smaller "s" factors maybe 5% to 20% at most.
You're right but the things you could do with it if you applied yourself are totally out of reach for me; for example it's quite possible for you to become an A.I researcher in one of the leading companies and make millions. I just don't have that kind of intellectual capacity. You could make it into med school and also make millions. I'm not saying all this matters that much, with all due respect to financial success, but I don't think we can pretend our society doesn't reward high IQs.
That high IQ needs to be paired with hard work.
Went to the equivalent of a mensa meeting group a couple of times. The people there were much smarter than me, but they all had their problems and many of them weren't that successful at all despite their obvious intelligence.
Searching and IQs FOR doctors seem to average about 120 with 80th percentile being 105-130. So there's plenty of doctors with IQs of 105 which is not that far above average.
That also means that it's prudent to be selective in your doctors if you have any serious medical issues.
1: https://www.cambridge.org/core/journals/cambridge-quarterly-...
Where are you getting this from exactly ? Getting in to a medical school is very difficult to do in the U.S. Having an average IQ of 105 would make it borderline impossible - even if you cram for SAT and tests twice as much as everyone else there is so much you can do - these tests test for speed and raw brain power. In my country - the SAT equivalent you need to have to get in would put you higher than top 2%, it's more like 1.5%-to 1%, because the population keeps growing but the number of working doctors remains quite constant. So really each high school had only 2-3 kids that would get in per class. I know a few of these people - really brilliant kids, their IQ's were probably above 130 and it's impossible for me to compete with them in getting in - I am simply not exceptional - at least not that far high in the distribution. I was maybe in the top 3-5 best students in my class but never the best, so lets say top 10%, these kids were the best students in the whole school - that's top 1%-2%.
One caveat to all this is that sure, in some countries it is easier to get in. People from my country (usually from families who can afford it) go to places like Romania, Czechoslovakia, Italy etc where it is much much easier to get in to med school (but costs quite a lot and also means you have to leave your home country for 7 years).
Now is it necessary to have an IQ off the charts to be a good doctor - no, probably not, but that's not what I was arguing, that's just how admission works.
I agree it'd be almost impossible, but apparently not impossible with an IQ of 105. Could be folks with ADHD whose composite IQ is brought down by a smaller working memory but whose long term associative memory is top notch. Could be older doctors from when admissions were easier. Could be plain old nepotism.
After all the AMA keeps admissions artificially low in the US to increase salary and prestige. It's big part of the reason medical costs are so highly in the US in my opinion.
Reference I found here:
https://forum.facmedicine.com/threads/medical-doctors-ranked...
> Hauser, Robert M. 2002. "Meritocracy, cognitive ability, and the sources of occupational success." CDE Working Paper 98-07 (rev)
After all you just might seem like an insufferable smartass to someone you probably want to be liked by. Why hurt interpersonal relationships for little gain?
If your colleague is really that bright, I wouldn't be surprised if they're simply careful about how much and when they show it to us common folk.
I don't think they are faking it.
Even if they've saturated the distinguishable quality for tasks they can both do, I'd expect a gap in what tasks they're able to do.
But one thing will stay consistent with LLMs for some time to come: they are programmed to produce output that looks acceptable, but they all unintentionally tend toward deception. You can iterate on that over and over, but there will always be some point where it will fail, and the weight of that failure will only increase as it deceives better.
Some things that seemed safe enough: Hindenburg, Titanic, Deepwater Horizon, Chernobyl, Challenger, Fukushima, Boeing 737 MAX.
So what analogy with AI are you trying to make? The straightforward one would be that there will be some toxic and dangerous LLMs (cough Grok cough), but that there will be many others that do their jobs as designed, and that LLMs in general will be a common technology going forward.
Titanic - people have been boating for two thousand years, and it was run into an iceberg in a place where icebergs were known to be, killing >1500 people.
Hindenburg was an aircraft design of the 1920s, very early in flying history, was one of the most famous air disasters and biggest fireballs and still most people survived(!), killing 36. Decades later people were still suggesting sabotage was the cause. It’s not a fair comparison, an early aircraft against a late boat.
Its predecessor the Graf Zeppelin[1] was one of the best flying vehicles of its era by safety and miles traveled, look at its achievements compared to aeroplanes of that time period. Nothing at the time could do that and was any other aircraft that safe?
If airships had the eighty more years that aeroplanes have put into safety, my guess is that a gondola with hydrogen lift bags dozens of meters above it could be - would be - as safe as a jumbo jet with 60,000 gallons of jet fuel in the wings. Hindenburg killed 36 people 80 years ago, aeroplane crashes have killed 500+ people as recently as 2014.
Wasn’t Challenger known to be unsafe? (Feynman inquiry?). And the 737 MAX was Boeing skirting safety regulations to save money.
Glad you mention it. Connecting back to AI: there are many possible future scenarios involving negative outcomes involving human sabotage of AI -- or using them to sabotage other systems.
The AI companies have convinced the US government that there should be no AI safety regulations: https://www.wired.com/story/plaintext-sam-altman-ai-regulati...
> everyone knows you need to carefully review vibe coded output. This [safety-critical company] hiring zero developers isn't representative of software development as a profession.
> They also used old 32b models for cost reasons so it doesn't knock against AI-assisted development either.
Look at the state of the world today, AirBus have a Hydrogen powered commercial aircraft[1]. Toyota have Hydrogen powered cars on the streets. People upload safety videos to YouTube of Hydrogen cars turning into four-meter flamethrowers as if that's reassuring[3]. There are many[2] Hydrogen refuelling gas stations in cities in California where ordinary people can plug high pressure Hydrogen hoses into the side of their car and refuel it from a high pressure Hydrogen tank on a street corner. That's not going to be safer when it's a 15 year old car, a spaced-out owner, and a skeezy gas station which has been looking the other way on maintenance for a decade, where people regularly hear gunshots and do burnouts and crash into things. Analysts are talking about the "Hydrogen Economy" and a tripling of demand for Green Hydrogen in the next two decades. But lifting something with Hydrogen? Something the Graf Zeppelin LZ-127 demonstrated could be done safely with 1920s technology? No! That's too dangerous!
Number of cars on the USA roads when Hindenburg burnt? Around 25 million. Now? 285 million, killing 40,000 people every year. A Hindenburg death toll two or three times a day, every day, on average. A 9/11 every couple of months. Nobody is as concerned as they are about airships because there isn't a massive fireball and a reporter saying "oh the humanity". 36 people died 80 years ago in an early air vehicle and it's stop everything, this cannot be allowed to continue! The comparisons are daft in so many ways. Say airships are too slow to be profitable, say they're too big and difficult to maneouvre against the wind. But don't say they were believed to be perfectly safe and turned out to be too dangerous and put that as a considered reasonable position to hold.
Some of the sabotage accusations suggested it was a gunshot, but you know why that's not so plausible? Because you can fire machine guns into Hydrogen blimps and they don't blow up! "LZ-39, though hit several times [by fighter aeroplane gunfire], proceeded to her base despite one or more leaking cells, a few killed in the crew, and a propeller shot off. She was repaired in less than a week. Although damaged, her hydrogen was not set on fire and the “airtight subdivision” provided by the gas cells insured her flotation for the required period. The same was true of the machine gun. Until an explosive ammunition was put into service no airplane attacks on airships with gunfire had been successful."[4]. How many people who say Hydrogen airships are too dangerous realise they can ever take machine gun fire into their gas bags and not burn and keep flying?
[1] https://www.airbus.com/en/innovation/energy-transition/hydro...
[2] https://afdc.energy.gov/fuels/hydrogen-locations#/find/neare...
[3] https://www.youtube.com/watch?v=OA8dNFiVaF0
[4] https://www.usni.org/magazines/proceedings/1936/september/vu...
Yes, because I'd get them to play each other?
I don't need to understand how the AI made the app I asked for or cured my cancer, but it'll be pretty obvious when the app seems to work and the cancer seems to be gone.
I mean, I want to understand how, but I don't need to understand how, in order to benefit from it. Obviously understanding the details would help me evaluate the quality of the solution, but that's an afterthought.
However, I do believe that once the genuine AGI threshold is reached it may cause a change in that rate. My justification is that while current models have gone from a slightly good copywriter in GPT-4 to very good copywriter in GPT-5, they've gone from sub-exceptional in ML research to sub-exceptional in ML research.
The frontier in AI is driven by the top 0.1% of AI researchers. Since improvement in these models is driven partially by the very peaks of intelligence, it won't be until models reach that level where we start to see a new paradigm. Until then it's just scale and throwing whatever works at the GPU and seeing what comes out smarter.
Presently we are still a long way from that. In my opinion we at least are as far away from AGI as 1970s mainframes were from LLMs.
I really don’t expect to see AGI in my lifetime.
And with AGI, you also have the likes of Sam Altman making up bullshit claims just to pump up the investment into OpenAI. So I wouldn’t take much of their claims seriously either.
LLMs are a fantastic invention. But they’re far closer to SMS text predict than they are to generalised intelligence.
Though what you might see is OpenAI et al redefine the term “AGI” just so they can say they’ve hit that milestone, again purely for their own financial gain.
I'm cautiously optimistic of each technology, but the point is it's easy to find bullshit predictions without actually gaining any insight into what will happen with a given technology.
You need both the generalised part of AGI and the ability to self learn. One without the other wouldn’t cause a singularity.
(Artificial General Intelligence says nothing about self-learning though. I presume you mean ASI?)
- the need to sleep for 1/3 of our life
- the need to eat, causing more pauses in work
- much slower (like several orders of magnitude slower) data input capabilities
- lossy storage (aka forgetfulness)
- emotions
- other primal urges, like the need to procreate
These big models don't dynamically update as days pass by - they don't learn. A personal assistant service may be able to mimic learning by creating a database of your data or preferences, but your usage isn't baked back into the big underlying model permanently.
I don't agree with "in our lifetimes", but the difference between training and learning is the bright red line. Until there's a model which is able to continually update itself, it's not AGI.
My guess is that this will require both more powerful hardware and a few more software innovations. But it'll happen.
I think we should be treating AGI like Cold Fusion, phrenology, or even alchemy. It is not science, but science fiction. It is not going to happen and no research into AGI will provide anything of value (except for the grifters pushing the pseudo-science).
You can only experience the world in one place in real time. Even if you networked a bunch of "experiencers" together to gather real time data from many places at the same time, you would need a way to learn and train on that data in real time that could incorporate all the simultaneous inputs. I don't see that capability happening anytime soon.
Point is, I think self-learning at any speed is huge and as soon as it's achieved, it'll explode quadratically even if the first few years are slow.
I think this type of thinking is a critical part of human creativity, and I can't see the current incarnation of agentic coding tools get there. They currently are way too reliant on a human carefully crafting the context and being careful of not putting in too many contradictory instructions or overloading the model with irrelevant details. An AGI has to be able to work productively on its own for days or weeks without going off on a tangent or suffering Xerox-like amnesia because it has compacted its context window 100 times.
The real irony is from now on, because people use this magic, it will stay forever. What you can count on in my opinion is that this whole world changes, you don't need to write sw anymore because everything is AI. Hard to imagine, and too far in the future to be relevant for speculations.
I think user experience and pricing models are the best here. Right now everyone’s just passing down costs as they come, no real loss leaders except a free tier. I looked at reviews of some of various wrappers on app stores, people say “I hate that I have to pay for each generation and not know what I’m doing to get”, market would like a service priced very differently. Is it economical? Many will fail, one will succeed. People will copy the model of that one.
SGI would be self-improving to some function with a shape close to linear based on the amount of time & resources. That's almost exclusively dependent on the software design, as currently transformers have shown to hit a wall at logarithmic progression x resources.
In other words, no, it has little to do with the commercial race.
I have a had a bunch of positive experiences as well, but when it goes bad, it goes so horribly bad and off the rails.
The real take-off / winner-take-all potential is in retrieval and knowing how to provide the best possible data to the LLM. That strategy will work regardless of the model.
I don't think this has anything to do with AGI. We aren't at AGI yet. We may be close or we may be a very long way away from AGI. Either way, current models are at a plateau and all the big players have more or less caught up with each other.
As is, AI is quite intelligent, in that it can process large quantities of diverse unstructured information and build meaningful insights. And that intelligence applies across an incredibly broad set of problems and contexts. Enough that I have a hard time not calling it general. Sure, it has major flaws that are obvious to us and it's much worse at many things we care about. But that's doesn't make it not intelligent or general. If we want to set human intelligence as the baseline, we already have a word for that: superintelligence.
Superintelligence implies its above human level, not at human level. General intelligence implies it can do what humans can do in general, and not just replace a few of the things humans can do.
The AIs improve by gradient descent, still the same as ever. It's all basic math and a little calculus, and then making tiny tweaks to improve the model over and over and over.
There's not a lot of room for intelligence to improve upon this. Nobody sits down and thinks really hard, and the result of their intelligent thinking is a better model; no, the models improve because a computer continues doing basic loops over and over and over trillions of times.
That's my impression anyway. Would love to hear contrary views. In what ways can an AI actually improve itself?
But I doubt we will ever see a fully autonomous, reliable AGI system.
This misunderstanding is nothing more than the classic "logistic curves look like exponential curves at the beginning". All (Transformee-based, feedforward) AI development efforts are plateauing rapidly.
AI engineers know this plateau is there, but of course every AI business has a vested interest in overpromising in order to access more funding from naive investors.
That took the wold from autocomplete to Claude and GPT.
Another 10,000x would do it again, but who has that kind of money or R&D breakthrough?
The way scaling laws work, 5,000x and 10,000x give a pretty similar result. So why is it surprising that competitors land in the same range? It seems hard enough to beat your competitor by 2x let alone 10,000x
Yes. And the fact they're instead clustering simply indicates that they're nowhere near AGI and are hitting diminishing returns, as they've been doing for a long time already. This should be obvious to everyone. I'm fairly sure that none of these companies has been able to use their models as a force multiplier in state-of-the-art AI research. At least not beyond a 1+ε factor. Fuck, they're just barely a force multiplier in mundane coding tasks.
That seems hardy surprising considering the condition to receive the benefit has not been met.
The person who lights a campfire first will become warmer than the rest, but while they are trying to light the fire the others are gathering firewood. So while nobody has a fire, those lagging are getting closer to having a fire.
What do you think AGI is?
How do we go from sentence composing chat bots to General Intelligence?
Is it even logical to talk about such a thing as abstract general intelligence when every form of intelligence we see in the real world is applied to specific goals as evolved behavioral technology refined through evolution?
When LLMs start undergoing spontaneous evolution then maybe it is nearer. But now they can't. Also there is so much more to intelligence than language. In fact many animals are shockingly intelligent but they can't regurgitate web scrapings.
Since then they've been about neck and neck with some models making different tradeoffs.
Nobody needs to reach AGI to take off. They just need to bankrupt their competitors since they're all spending so much money.
There were two interesting takeaways about AGI:
1. Dario makes the remark that the term AGI/ASI is very misleading and dangerous. These terms are ill defined and it's more useful to understand that the capabilities are simply growing exponentially at the moment. If you extrapolate that, he thinks it may just "eat the majority of the economy". I don't know if this is self-serving hype, and it's not clear where we will end up with all this, but it will be disruptive, no matter what.
2. The Economist moderators however note towards the end that this industry may well tend toward commoditization. At the moment these companies produce models that people want but others can't make. But as the chip making starts to hits its limits and the information space becomes completely harvested, capability-growth might taper off, and others will catch up. The quasi-monopoly profit potentials melting away.
Putting that together, I think that although the cognitive capabilities will most likely continue to accelerate, albeit not necessarily along the lines of AGI, the economics of all this will probably not lead to a winner takes all.
[1] https://www.economist.com/podcasts/2025/07/31/artificial-int...
2. Commoditization can be averted with access to proprietary data. This is why all of ChatGPT, Claude, and Gemini push for agents and permissions to access your private data sources now. They will not need to train on your data directly. Just adapting the models to work better with real-world, proprietary data will yield a powerful advantage over time.
Also, the current training paradigm utilizes RL much more extensively than in previous years and can help models to specialize in chosen domains.
About 2: Ah, yes. So if one vendor gains sufficient momentum, their advantage may accelerate, which will be very hard to catch up with.
I also feel like, it's stopped being exponential already. I mean last few releases we've only seen marginal improvements. Even this release feels marginal, I'd say it feels more like a linear improvement.
That said, we could see a winner take all due to the high cost of copying. I do think we're already approaching something where it's mostly price and who released their models last. But the cost to train is huge, and at some point it won't make sense and maybe we'll be left with 2 big players.
It's not obvious if a similar breakthrough could occur in AI
This seems to be a result of using overly simplistic models of progress. A company makes a breakthrough, the next breakthrough requires exploring many more paths. It is much easier to catch up than find a breakthrough. Even if you get lucky and find the next breakthrough before everyone catches up, they will probably catch up before you find the breakthrough after that. You only have someone run away if each time you make a breakthrough, it is easier to make the next breakthrough than to catch up.
Consider the following game:
1. N parties take turns rolling a D20. If anyone rolls 20, they get 1 point.
2. If any party is 1 or more points behind, they get only need to roll a 19 or higher to get one point. That is being behind gives you a slight advantage in catching up.
While points accumulate, most of the players end up with the same score.
I ran a simulation of this game for 10,000 turns with 5 players:
Game 1: [852, 851, 851, 851, 851]
Game 2: [827, 825, 827, 826, 826]
Game 3: [827, 822, 827, 827, 826]
Game 4: [864, 863, 860, 863, 863]
Game 5: [831, 828, 836, 833, 834]
But yes, so far it feels like we are in the latter stages of the innovation S-curve for transformer-based architectures. The exponent may be out there but it probably requires jumping onto a new S-curve.
I think it does let you start explore the paths faster, but the search space you need to cover grows even faster. You can do research two times faster but you need to do ten times as much research and your competition can quickly catch up because they know what path works.
It is like drafting in a bike race.
Barring a kind of grey swan event of groundbreaking algorithmic innovation, I don't see how we get out of this. I suppose it could be that some of those diminishing returns are still big enough to bridge the gap to create an AI that can meaningfully recursively improve itself, but I personally don't see it.
At the moment, I would say everything is progressing exactly as expected and will continue to do so until it doesn't. If or when that happens is not predictable.
You are completely right that the compute and funding right now are unprecedented. I don't feel confident making any predictions.
Consider the research work for five in series breakthroughs: 1, 2, 16, 8, 128 each breakthrough doubles your research power.
If you start at 1 research, you get the first breakthrough after 1/1=1 year. Then you get the second breakthrough after 2/2=1 year. Then you get the third breakthrough after 16/4 = 4 years. The fourth breakthrough after 8/8= year. The fifth breakthrough after 128/16 = 8 years.
If it only takes one year for a competitor to learn your breakthrough, they can catch up despite the fact that your research rate is doubling after every breakthrough.
This assumes an infinite potential for improvement though. It's also possible that the winner maxes out after threshold day plus one week, and then everyone hits the same limit within a relatively short time.
That's only one part of it. Some forecasters put probabilities on each of the four quadrants in the takeoff speed (fast or slow) vs. power distribution (unipolar or multipolar) table.
To be honest that is what you would want if you were digitally transforming the planet with AI.
You would want to start with a core so that all models share similar values in order they don't bicker etc, for negotiations, trade deals, logistics.
Would also save a lot of power so you don't have to train the models again and again, which would be quite laborious and expensive.
Rather each lab would take the current best and perform some tweak or add some magic sauce then feed it back into the master batch assuming it passed muster.
Share the work, globally for a shared global future.
At least that is what I would do.
Current models, when they apply reasoning, have feedback loops using tools to trial and error, and have a short term memory (context) or multiple short term memories if you use agents, and a long term memory (markdown, rag), they can solve problems that aren't hardcoded in their brain/model. And they can store these solutions in their long term memory for later use. Or for sharing with other LLM based systems.
AGI needs to come from a system that combines LLMs + tools + memory. And i've had situations where it felt like i was working with an AGI. The LLMs seem advanced enough as the kernel for an AGI system.
The real challenge is how are you going to give these AGIs a mission/goal that they can do rather independently and don't need constant hand-holding. How does it know that it's doing the right thing. The focus currently is on writing better specifications, but humans aren't very good at creating specs for things that are uncertain. We also learn from trial and error and this also influences specs.
Nothing we have is anywhere near AGI and as models age others can copy them.
I personally think we are closing the end of improvement for LLMs with current methods. We have consumed all of the readily available data already, so there is no more good quality training material left. We either need new novel approaches or hope that if enough compute is thrown at training actual intelligence will spontaneously emerge.
If I were working in a job right now where I could see and guide and retrain these models daily, and realized I had a weapon of mass destruction on my hands that could War Games the Pentagon, I'd probably walk my discoveries back too. Knowing that an unbounded number of parallel discoveries were taking place.
It won't take AGI to take down our fragile democratic civilization premised on an informed electorate making decisions in their own interests. A flood of regurgitated LLM garbage is sufficient for that. But a scorched earth attack by AGI? Whoever has that horse in their stable will absolutely keep it locked up until the moment it's released.
Both the AGI threshold with LLM architecture, and the idea of self-advancing AI, is pie in the sky, at least for now. These are myths of the rationalist cult.
We'd more likely see reduced returns and smaller jumps between version updates, plus regression from all the LLM produced slop that will be part of the future data.
Even if we run with the assumption that LLMs can become human-level AI researchers, and are able to devise and run experiments to improve themselves, even then the runaway singularity assumption might not hold. Let's say Company A has this LLM, while company B does not.
- The automated AI researcher, like its human peers, still needs to test the ideas and run experiments, it might happen that testing (meaning compute) is the bottleneck, not the ideas, so Company A has no real advantage.
- It might also happen that AI training has some fundamental compute limit coming from information theory, analogous to the Shannon limit, and once again, more efficient compute can only approach this, not overcome it
Why is this even an axiom, that this has to happen and it's just a matter of time?
I don't see any credible argument for the path LLM -> AGI, in fact given the slowdown in enhancement rate over the past 3 years of LLMs, despite the unprecedented firehose of trillions of dollars being sunk into them, I think it points to the contrary!
But nowdays, how corpos can "justify" their R&D to spend gigantic amount of resources (time + hardware + energy) in models which are not LLMs?
But I think we're not even on the path to creating AGI. We're creating software that replicate and remix human knowledge at a fixed point in time. And so it's a fixed target that you can't really exceed, which would itself already entail diminishing returns. Pair this with the fact that it's based on neural networks which also invariably reach a point of sharply diminishing returns in essentially every field they're used in, and you have something that looks much closer to what we're doing right now - where all competitors will eventually converge on something largely indistinguishable from each other, in terms of ability.
Why would you presume this? I think part of a lot of people's AI skepticism is talk like this. You have no idea. Full stop. Why wouldn't progress be linear? As new breakthroughs come, newer ones will be harder to come by. Perhaps it's exponential. Perhaps it's linear. No one knows.
Sure, you can scale it, but if an LLM takes, say, $1 million a year to run an AGI instance, but it costs only $500k for one human researcher, then it still doesn’t get you anywhere faster than humans do.
It might scale up, it might not, we don’t know. We won’t know until we reach it.
We also don’t know if it scales linearly. Or if it’s learning capability and capacity will able to support exponential capability increase. Our current LLM’s don’t even have the capability of self improvement or learning even if they were capable: they can accumulate additional knowledge through the context window, but the models are static unless you fine tune or retrain them. What if our current models were ready for AGI but these limitations are stopping it? How would we ever know? Maybe it will be able to self improve but it will I’ll take exponentially larger amounts of training data. Or exponentially larger amounts of energy. Or maybe it can become “smarter” but at the cost of being larger to the point where the laws of physics mean it has to think slower, 2x the thinking but 2x the time, could happen! What if an AGI doesn’t want to improve?
Far too many unknowns to say what will happen.
Just from the fact that the LLM can/will work on the issue 24/7 vs a human who typically will want to do things like sleep, eat, and spend time not working, there would already be a noticeable increase in research speed.
Imagine a field where experiments take days to complete, and reviewing the results and doing deep thought work to figure out the next experiment takes maybe an hour or two for an expert.
An LLM would not be able to do 24/7 work in this case, and would only save a few hours per day at most. Scaling up to many experiments in parallel may not always be possible, if you don't know what to do with additional experiments until you finish the previous one, or if experiments incur significant cost.
So an AGI/expert LLM may be a huge boon for e.g. drug discovery, which already makes heavy use of massively parallel experiments and simulations, but may not be so useful for biological research (perfect simulation down to the genetic level of even a fruit fly likely costs more compute than the human race can provide presently), or research that involves time-consuming physical processes to complete, like climate science or astronomy, that both need to wait periodically to gather data from satellites and telescopes.
With automation, one AI can presumably do a whole lab's worth of parallel lab experiments. Not to mention, they'd be more adept at creating simulations that obviates the need for some types of experiments, or at least, reduces the likelihood of dead end experiments.
There are too many unknowns to make any assertions about what will or won’t happen.
Are you sure? I previously accepted that as true, but, without being able to put my finger on exactly why, I am no longer confident in that.
What are you supposed to do if you are a manically depressed robot? No, don't try to answer that. I'm fifty thousand times more intelligent than you, and even I don't know the answer. It gives me a headache just trying to think down to your level. -- Marvin to Arthur Dent
(...as an anecdote, not the impetus for my change in view.)
Driving A to B takes 5 hours, if we get five drivers will we arrive in one hour or five hours? In research there are many steps like this (in the sense that the time is fixed and independent to the number of researchers or even how much better a researcher can be compared to others), adding in something that does not sleep nor eat isn't going to make the process more efficient.
I remember when I was an intern and my job was to incubate eggs and then inject the chicken embryo with a nanoparticle solution to then look under a microscope. In any case incubating the eggs and injecting the solution wasn't limited by my need to sleep. Additionally our biggest bottleneck was the FDA to get this process approved, not the fact that our interns required sleep to function.
My point here was simply that there is an economic factor that trivially could make AGI less viable over humans. Maybe my example numbers were off, but my point stands.
So I don't think it's a given that progress will just be "exponential" once we have an AGI that can teach itself things. There is a vast ocean of original thought that goes beyond simple self-optimization.
Fundamentally discovery could be described as looking for gaps in our observation and then attempting to fill in those gaps with more observation and analysis.
The age of low hanging fruit shower thought inventions draws to a close when every field requires 10-20+ years of study to approach a reasonable knowledge of it.
"Sparks" of creativity, as you say, are just based upon memories and experience. This isn't something special, its an emergent property of retaining knowledge and having thought. There is no reason to think AI is incapable of hypothesizing and then following up on those.
Every AI can be immediately imparted with all expert human knowledge across all fields. Their threshold for creativity is far beyond ours, once tamed.
That's nothing close to AGI though. An AI of some kind may be able to design and test new algorithms because those algorithms live entirely in the digital world, but that skill isn't generalized to anything outside of the digital space.
Research is entirely theoretical until it can be tested in the real world. For an AGI to do that it doesn't just need a certain level of intelligence, it needs a model of the world and a way to test potential solutions to problems in the real world.
Claims that AGI will "solve" energy, cancer, global warming, etc all run into this problem. An AI may invent a long list of possible interventions but those interventions are only as good as the AI's model of the world we live in. Those interventions still need to be tested by us in the real world, the AI is really just guessing at what might work and has no idea what may be missing or wrong in its model of the physical world.
Those observations only lead to scaling research linearly, not exponentially.
Assuming a given discovery requires X units of effort, simply adding more time and more capacity just means we increase the slope of the line.
Exponential progress requires accelerating the rate of acceleration of scientific discovery, and for all we know that's fundamentally limited by computing capacity, energy requirements, or good ol' fundamental physics.
Where I'm skeptical of AI would be in the idea an LLM can ever get to AGI level, if AGI is even really possible, and if the whole thing is actually viable. I'm also very skeptical that the discoveries of any AGI would be shared in ways that would allow exponential growth; licenses stopping using their AGI to make your own, copyright on the new laws of physics and royalties on any discovery you make from using those new laws etc.
Prove it.
Also, AI will need resources. Hardware. Water. Electricity. Can those resources be supplied at an exponential rate? People need to calm down and stop stating things as truth when they literally have no idea.
Progress has been exponential in the generic. We made approximately the same progress in the past 100 years as the prior 1000 as the prior 30,000, as the prior million, and so on, all the way back to multicellular life evolving over 2 billion years or so.
There's a question of the exponent, though. Living through that exponential growth circa 50AD felt at best linear, if not flat.
I hear this sort of argument all the time, but what is it even based on? There’s no clear definition of scientific and technological progress, much less something that’s measurable clearly enough to make claims like this.
As I understand it, the idea is simply “Ooo, look, it took ten thousand years to go from fire to wheel, but only a couple hundred to go from printing press to airplane!!!”, and I guess that’s true (at least if you have a very juvenile, Sid Meier’s Civilization-like understanding of what history even is) but it’s also nonsense to try and extrapolate actual numbers from it.
Has it? Really?
Consider theoretical physics, which hasn't significantly advancement since the advent of general relativity and quantum theory.
Or neurology, where we continue to have only the most basic understanding of how the human mind actually works (let alone the origin of consciousness).
Heck, let's look at good ol' Moore's Law, which started off exponential but has slowed down dramatically.
It's said that an S curve always starts out looking exponential, and I'd argue in all of those cases we're seeing exactly that. There's no reason to assume technological progress in general, whether via human or artificial intelligence, is necessarily any different.
That's all noise.
They are unrelated. All you need is a way for continual improvement without plateauing, and this can start at any level of intelligence. As it did for us; humans were once less intelligent.
Using the flagship to bootstrap the next iteration with synthetic data is standard practice now. This was mentioned in the GPT5 presentation. At the rate things are going I think this will get us to ASI, and it's not going to feel epochal for people who have interacted with existing models, but more of the same. After all, the existing models are already smarter than most humans and most people are taking it in their stride.
The next revolution is going to be embodiment. I hope we have the commonsense to stop there, before instilling agency.
Do we know what drove the increases in intelligence? Was it some level of intelligence bootstrapping the next level of intelligence? OR was it other biophysical and environmental effects that shaped increasing intelligence?
https://www.sciencedirect.com/topics/psychology/social-intel...
US: "A reverse Flynn effect was found for composite ability scores with large US adult sample from 2006 to 2018 and 2011 to 2018. Domain scores of matrix reasoning, letter and number series, verbal reasoning showed evidence of declining scores."
https://www.sciencedirect.com/science/article/pii/S016028962...
https://www.forbes.com/sites/michaeltnietzel/2023/03/23/amer...
Denmark: "The results showed that the estimated mean IQ score increased from a baseline set to 100 (SD: 15) among individuals born in 1940 to 108.9 (SD: 12.2) among individuals born in 1980, since when it has decreased."
https://pubmed.ncbi.nlm.nih.gov/34882746/
https://pubmed.ncbi.nlm.nih.gov/34882746/#&gid=article-figur...
1. Higher nutrition levels allowed the brain to grow. 2. Hunting required higher levels of strategy and tactics than picking fruit off trees. 3. Not needing to eat continuously (as we did on vegetation) to get what we needed allowed us time to put our efforts into other things.
Now did the diet cause the change, or the change necessitate the change in diet... I don't think we know.
This doesn't really make sense outside computers. Since AI would be training itself, it needs to have the right answers, but as of now it doesn't really interact with the physical world. The most it could do is write code, and check things that have no room for interpretation, like speed, latency, percentage of errors, exceptions, etc.
But, what other fields would it do this in? How can it makes strives in biology, it can't dissect animals, it can't figure more out about plants that humans feed into the training data. Regarding math, math is human-defined. Humans said "addition does this", "this symbol means that", etc.
I just don't understand how AI could ever surpass anything human known before we live by the rules defined by us.
"But when AI got finally access to a bank account and LinkedIn, the machines found the only source of hands it would ever need."
That's my bet at least - especially with remote work, etc. is that if the machines were really superhuman, they could convince people to partner with it to do anything else.
I am amazed, hopeful, and terrified TBH.
The idea is a sufficiently advanced AI could simulate.. everything. You don't need to interact with the physical world if you have a perfect model of it.
> But, what other fields would it do this in? How can it makes strives in biology, it can't dissect animals ...
It doesn't need to dissect an animal if it has a perfect model of it that it can simulate. All potential genetic variations, all interactions between biological/chemical processes inside it, etc.
> if it has a perfect model
How does it create a perfect model of the world without extensive interaction with the actual world?
If you had that perfect model, you’ve basically solved an entire field of science. There wouldn’t be a lot more to learn by plugging it into a computer afterwards.
But, again with the caveats above: if we assume an AI that is infinitely more intelligent than us and capable of recursive self-improvement to where it's compute was made more powerful by factorial orders of magnitude, it could simply brute force (with a bit of derivation) everything it would need from the data currently available.
It could iteratively create trillions (or more) of simulations until it finds a model that matches all known observations.
This does not answer the question. The question is "how does it become this intelligent without being able to interact with the physical world in many varied and complex ways?". The answer cannot be "first, it is superintelligent". How does it reach superintelligence? How does recursive self-improvement yield superintelligence without the ability to richly interact with reality?
> it could simply brute force (with a bit of derivation) everything it would need from the data currently available. It could iteratively create trillions (or more) of simulations until it finds a model that matches all known observations.
This assumes that the digital encoding of all recorded observations is enough information for a system to create a perfect simulation of reality. I am quite certain that claim is not made on solid ground, it is highly speculative. I think it is extremely unlikely, given the very small number of things we've recorded relative to the space of possibilities, and the very many things we don't know because we don't have enough data.
This is a demonstrably false assumption. Foundational results in chaos theory show that many processes require exponentially more compute to simulate for a linearly longer time period. For such processes, even if every atom in the observable universe was turned into a computer, they could only be simulated for a few seconds or minutes more, due to the nature of exponential growth. This is an incontrovertible mathematical law of the universe, the same way that it's fundamentally impossible to sort an arbitrary array in O(1) time.
Yes, it's a very hand-wavey argument.
>It doesn't need to dissect an animal if it has a perfect model of it that it can simulate. All potential genetic variations, all interactions between biological/chemical processes inside it, etc.
Emphasis on perfection, easier said than done. Some how this model was able to simulate millions of years of evolution so it could predict vestigial organs of unidentified species? We inherently cannot model how a pendulum with three arms can swing but somehow this AI figured out how to simulate evolution millions of years ago with unidentified species in the Amazon and can tell you all of its organs before anyone can check with 100% certainty?
I feel like these AI doomers/optimists are going to be in a shock when they find out that (unfortunately) John Locke was right about empiricism, and that there is a reason we use experiments and evidence to figure out new information. Simulations are ultimately not enough for every single field.
Why would the AI want to improve itself? From whence would that self-motivation stem?
I am very far from convinced that we are at or near that point.
But seriously, one would assume there's a reward system of some sort at play, otherwise why do anything?
All the technological revolutions so far have accounted for little more than a 1.5% sustained annual productivity growth. There are always some low-hanging fruit with new technology, but once they have been picked, the effort required for each incremental improvement tends to grow exponentially.
That's my default scenario with AGI as well. After AGI arrives, it will leave humans behind very slowly.
You cannot beat humans with megawatts!
AI can do it fine as it knows A and B. And that is knowledge creation.
I would also not be surprised if the process of developing something comparable to human intelligence, assuming the extreme computation, energy, and materials issues of packing that much computation and energy into a single system could be overcome, the AI also develops something comparable to human desire and/or mental health issues. There is a not-zero chance we could end up with AI that doesn't want to do what we ask it to do or doesn't work all the time because it wants to do other things.
You can't just assume exponential growth is a forgone conclusion.
I think this is a hard kick below the belt for anyone trying to develop AGI using current computer science.
Current AIs only really generate - no, regenerate text based on their training data. They are only as smart as other data available. Even when an AI "thinks", it's only really still processing existing data rather than making a genuinely new conclusion. It's the best text processor ever created - but it's still just a text processor at its core. And that won't change without more hard computer science being performed by humans.
So yeah, I think we're starting to hit the upper limits of what we can do with Transformers technology. I'd be very surprised if someone achieved "AGI" with current tech. And, if it did get achieved, I wouldn't consider it "production ready" until it didn't need a nuclear reactor to power it.
It seems like the LLM model will be component of an eventual AGI, it's voice per se, but not its mind. The mind still requires another innovation or breakthrough we haven't seen yet.
It's the systems around the models where the proprietary value lies.
I wonder if that's because they have a lot of overlap in learning sets, algorithms used, but more importantly, whether they use the same benchmarks and optimize for them.
As the saying goes, once a metric (or benchmark score in this case) becomes a target, it ceases to be a valuable metric.
AGI over LLMs is basically 1 billion tokens for AI to answer the question: how do you feel? and a response of "fine"
Because it would mean it's simulating everything in the world over an agentic flow considering all possible options checking memory checking the weather checking the news... activating emotional agentic subsystems, checking state... saving state...
2. ben evans frequently makes fun of the business value. pretty clear a lot of the models are commodotized.
3. strategically, the winners are platforms where the data are. if you have data in azure, that's where you will use your models. exclusive licensing could pull people to your cloud from on prem. so some gains may go to those companies ...
on the other hand, there are still some flaws regarding GPT-5. for example, when i use it for research it often needs multiple prompts to get the topic i truly want and sometimes it can feed me false information. so the reasoning part is not fully there yet?
> Academics distorting graphs to make their benchmarks appear more impressive
> lavish 1.5 million dollar bonuses for everyone at the company
> Releasing an open source model that doesn't even use latent multi head attention in a open source AI world led by Chinese labs
> Constantly overhyping models as scary and dangerous to buy time to lobby against competitors and delay product launches
> Failing to match that hype as AGI is not yet here
https://help.openai.com/en/articles/6825453-chatgpt-release-...
"If you open a conversation that used one of these models, ChatGPT will automatically switch it to the closest GPT-5 equivalent."
- 4o, 4.1, 4.5, 4.1-mini, o4-mini, or o4-mini-high => GPT-5
- o3 => GPT-5-Thinking
- o3-Pro => GPT-5-Pro
Maybe to service more users they're thinking they'll shrink the models and have reasoning close the gap... of course, that only really works for verifiable tasks.
And I've seen the claims of a "universal verifier", but that feels like the Philosopher's Stone of AI. Everyone who's tried it has shown limited carryover between verifiable tasks (like code) to tasks with subjective preference.
-
To clarify also: I don't think this is nefarious. I think as you serve more users, you need to at least try to reign in the unit economics.
Even OpenAI can only afford to burn so many dollars per user per week once they're trying to serve a billion users a week. At some point there isn't even enough money to be raised to keep up with costs.
The names of GPT models are just terrible. o3 is better than 4o, maybe?
Of course, I know that having a line-up of tons of models is quite confusing. Yet I also believe users on the paid plan deserve more options.
As a paying user, I liked the ability to set which models to use each time, in particular switching between o4-mini and o4-mini-high.
Now they’ve deprecated this feature and I’m stuck with their base GPT-5 model or GPT-5 Thinking, which seems akin to o3 and thus has much smaller usage limits. Only God knows whether their routing will work as well as my previous system for selecting models.
I suppose this is probably the point. I’m still not super keen on ponying up 200 bucks a month, but it’s more likely now.
I wouldn't want to be in charge of regression testing an LLM-based enterprise software app when bumping the underlying model.
Regular users just see incrementing numbers, why would they want to use 3 or 4 if there is a 5? This is how people who aren't entrenched in AI think.
Ask some of your friends what the difference is between models and some will have no clue that currently some of the 3 models are better than 4 models, or they'll not understand what the "o" means at all. And some think why would I ever use mini?
I think people here vastly underestimate how many people just type questions into the chatbox, and that's it. When you think about the product from that perspective, this release is probably a huge jump for many people who have never used anything but the default model. Whereas, if you've been using o3 all along, this is just another nice incremental improvement.
It is frankly ridiculous to assume anyone would think that 4o is in anyway worse then o3. I don't understand why these companies suck at basic marketing this hard, like what is with all these .5s and mini and other shit names. Just increment the fucking number or if you are embarrassed by having to increase the number all the time just use year/month. Then you can have different flavors like "light and fast" or "deep thinker" and of course just the regular "GPT X"
GPT-5: Key characteristics, pricing and model card - https://news.ycombinator.com/item?id=44827794
I know these companies do "shadow" updates continuously anyway so maybe it is meaningless but would be super interesting to know, nonetheless!
OpenAI and Anthropic don't update models without changing their IDs, at least for model IDs with a date in them.
OpenAI do provide some aliases, and their gpt-5-chat-latest and chatgpt-4o-latest model IDs can change without warning, but anything with a date in (like gpt-5-2025-08-07) stays stable.
Thank you to Simon; your notes are exactly what I was hoping for.
Suspicious.
I continue to think that the 12B model is something of a miracle. I've spent less time with the 120B one because I can't run it on my own machine.
It’s reasonable that he might be a little hyped about things because of his feelings about them and the methodology he uses to evaluate models. I assume good faith, as the HN guidelines propose, and this is the strongest plausible interpretation of what I see in his blog.
Based on my reading of some of your blogs and reading your discussions with others on this site, you still lack technical depth and understanding of the underlying mechanisms at what I would call an expert level. I hope this doesn't sound insulting, maybe you have a different definition of "expert". I also do not say you lack the capacity to become an expert someday. I just want to explain why, while you consider yourself an expert, some people could not see you as an expert. But as I said, maybe it's just different definitions. But your blogs still have value, a lot of people read them and find them valuable, so your work is definitely worthwhile. Keep up the good work!
AI engineering, not ML engineering, is one way of framing that.
I don't write papers (I don't have the patience for that), but my work does get cited in papers from time to time. One of my blog posts was the foundation of the work described in the CaMeL paper from DeepMind for example: https://arxiv.org/abs/2503.18813
I called out the prompt injection section as "pretty weak sauce in my opinion".
I did actually have a negative piece of commentary in there about how you couldn't see the thinking traces in the API... but then I found out I had made a mistake about that and had to mostly remove that section! Here's the original (incorrect) text from that: https://gist.github.com/simonw/eedbee724cb2e66f0cddd2728686f... - and the corrected update: https://simonwillison.net/2025/Aug/7/gpt-5/#thinking-traces-...
The reason there's not much negative commentary in the post is that I genuinely think this model is really good. It's my favorite model right now. The moment that changes (I have high hopes for Claude 5 and Gemini 3) I'll write about it.
Did you ask it to format the table a couple paragraphs above this claim after writing about hallucinations? Because I would classify the sorting mistake as one
What about the „9.9 / 9.11“ example?
It’s unclear to me where to draw the line between skill issue and hallucination. I image that one influences the other?
I would like to see a demo where they go through the bug, explain what are the tricky parts and show how this new model handle these situations.
Every demo I've seen seems just the equivalent of "looks good to me" comment in a merge request.
Given the low cost of GPT-5, compared to the prices we saw with GPT-4.5, my hunch is that this new model is actually just a bunch of RL on top of their existing models + automatic switching between reasoning/non-reasoning.
Something similar with this might happen, an underlying curse hidden inside an apparenting ground-breaking desigb.
[1] https://chatgpt.com/s/t_6894f13b58788191ada3fe9567c66ed5
The actual benchmark improvements are marginal at best - we're talking single-digit percentage gains over o3 on most metrics, which hardly justifies a major version bump. What we're seeing looks more like the plateau of an S-curve than a breakthrough. The pricing is competitive ($1.25/1M input tokens vs Claude's $15), but that's about optimization and economics, not the fundamental leap forward that "GPT-5" implies. Even their "unified system" turns out to be multiple models with a router, essentially admitting that the end-to-end training approach has hit diminishing returns.
The irony is that while OpenAI maintains their secretive culture (remember when they claimed o1 used tree search instead of RL?), their competitors are catching up or surpassing them. Claude has been consistently better for coding tasks, Gemini 2.5 Pro has more recent training data, and everyone seems to be converging on similar performance levels. This launch feels less like a victory lap and more like OpenAI trying to maintain relevance while the rest of the field has caught up. Looking forward to seeing what Gemini 3.0 brings to the table.
For all of them, getting access to full-blown GPT-5 will probably be mind-blowing, even if it's severely rate-limited. OpenAI's previous/current generation of models haven't really been ergonomic enough (with the clunky model pickers) to be fully appreciated by less tech-savvy users, and its full capabilities have been behind a paywall.
I think that's why they're making this launch is a big deal. It's just an incremental upgrade for the power users and the people that are paying money, but it'll be a step-change in capability to everyone else.
replacing huge swathes of the white collar workforce
"incremental upgrade for power users" is not at all what this house of cards is built on
GPT-5 demonstrates exponential growth in task completion times:
https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...
Exponential would be at 3.6 hours
It's like no one looked at the charts, ever, and they just came straight from.. gpt2? I don't think even gpt3 would have fucked that up.
I don't know any of those people, but everyone that has been with OAI for longer than 2 years 1.5m bonuses, and somehow they can't deliver a bar chart with sensible at axes?
it has to be released because it's not much better and OpenAI needs the team to stop working on it. They have serious competition now and can't afford to burn time / money on something that isn't shifting the dial.
So you might get that initial MvP out the door quickly, but when the complexity grows even just a little bit, you will be forced to stop and look at the plan and try to get it to develop it saying things like: "use Design agent to ultrathink about the dependencies of the current code change on other APIs and use TDD agent to make sure tests are correct in accordance with the requirements I stated" and then one finds that even the all the thinking there are bugs that you will have to fix.
Source: I just tried max pro on two client python projects and it was horrible after week 2.
I think "starting today" might be doing some heavy lifting in that sentence.
https://github.blog/changelog/2025-08-07-openai-gpt-5-is-now...
They've topped and are looking to cash out:
https://www.reuters.com/business/openai-eyes-500-billion-val...
Today it seems pretty good. Not perfect, but not a spectacular failure.
That said, yeah the equal time thing never made any sense.
This seemed like a presentation you'd give to a small org, not a presentation a $500B company would give to release it's newest, greatest thing.
Bad data on graphs, demos that would have been impressive a year ago, vibe coding the easiest requests (financial dashboard), running out of talking points while cursor is looping on a bug, marginal benchmark improvements. At least the models are kind of cheaper to run.
It's going to be absolute chaos. Compsci was already mostly a meme, with people not able to program getting the degree. Now we're going to have generations of people that can't program at all, getting jobs at google.
If you can actually program, you're going to be considered a genius in our new idiocracy world. "But chatgpt said it should work, and chatgpt has what people need"
I'm an okay agent, I can make plans, execute on them, I know what needs to go where. I might not be able to write a terraform file or one shot a dynamic programming task like Claude can, and that's what I need help with.
I'd like to have an off switch for all this agentic behavior.
That lag! Are humans (training) the bottleneck?
It's slightly better than what I was expecting.
> emdash 3 words into their highlighted example
Like a Turing test but between the models.
There would be no GPT without Google, no Google without the WWW, no WWW without TCP/IP. This is why I believe calling it "AI" is a mistake or just for marketing, we should call all of them GPTs or search engines 2.0. This is the natural next step after you have indexed most of the web and collected most of the data.
Also there would be no coding agents without Free Software and Open-Source.
I've got nothing. Cannot see how it helps openai to look incompetent while trying to raise money.
Two concerning things: - thinking/non-thinking is still not really unified, you can chose and the non-thinking version still doesn't start thinking on tasks that could obviously get better results with thinking
- all the older models are gone! No 4o, 4.1, 4.5, o3 available anymore
What excites me now is that Gemini 3.0 or some answer from Google is coming soon and that will be the one I will actually end up using. It seems like the last mover in the LLM race is more advantageous.
They just removed ManifestV2.
https://polymarket.com/event/which-company-has-best-ai-model...
(I'm mostly making this comment to document what happened for the history books.)
https://polymarket.com/event/which-company-has-best-ai-model...
I think he's just good at attracting good talent, and letting them focus on the right things to move fast initially, while cutting the supporting infra down to zero until it's needed.
https://futurism.com/elon-musk-memphis-illegal-generators
It's hackery but also kind of sociopathic to dump a bunch of loud, dirty generators in the middle of a low-income community. Go set your data center up on Martha's Vineyard and see how long the residents put up with it.
Either way or people think Trump will just give Elon a 500B government contract...
I need to spend some more time with Gemini too though. I was using that as a backend for Cursor for a while and had some good results there too.
That eval has also become a lot less relevant (it's considered not very indicative of real-world performance), so it's unlikely Anthropic will prioritize optimizing for it in future models.
Meanwhile Meta and Xai are behind the ball and largely marketing focused.
Generally when you have a lot of companies competing to show whos product X does the best at Y, there's a lot of monetary incentives to manipulate the products to perform well specifically on those types of tests.
On the chat side, it's also quite different, and I wouldn't be surprised if people need some time to get a taste and a preference for it. I ask most models to help me build a macbook pro charger in 15th century florence with the instructions that I start with only my laptop and I can only talk for four hours of chat before the battery dies -- 5 was notable in that it thought through a bunch of second order implications of plans and offered some unusual things, including a list of instructions for a foot-treadle-based split ring commutator + generator in 15th century florentine italian(!). I have no way of verifying if the italian was correct.
Upshot - I think they did something very special with long context and iterative task management, and I would be surprised if they don't keep improving 5, based on their new branding and marketing plan.
That said, to me this is one of the first 'product release' moments in the frontier model space. 5 is not so much a model release as a polished-up, holes-fixed, annoyances-reduced/removed, 10x faster type of product launch. Google (current polymarket favorite) is remarkably bad at those product releases.
Back to betting - I bet there's a moment this year where those numbers change 10% in oAIs favor.
who will decide the winner to resolve bets?
"Assume the earth was just an ocean and you could travel by boat to any location. Your goal is to always stay in the sunlight, perpetually. Find the best strategy to keep your max speed as low as possible"
o3 pro gets it right though..
>So the “best possible” plan is: sit still all summer near a pole, slow-roll around the pole through equinox, then sprint westward across the low latitudes toward the other pole — with a peak westward speed up to ~1670 km/h.
Is this to your liking?
when models try to be smart/creative they attempt to switch poles like that. in my example it even says that the max speed will be only a few km/h (since their strategy is to chill at the poles and then sail from north to south pole very slowly)
--
GPT-5 pro does get it right though! it even says this:
"Do not try to swap hemispheres to ride both polar summers. You’d have to cross the equator while staying in daylight, which momentarily forces a westward component near the equatorial rotation speed (~1668 km/h)—a much higher peak speed than the 663 km/h plan."
how many rs in cranberry?
-- GPT5's response: The word cranberry has two “r”s. One in cran and one in berry.
Kimi2's response: There are three letter rs in the word "cranberry".
Text is broken into tokens in training (subword/multi-word chunks) rather than individual characters; the model doesn’t truly "see" letters or spaces the way humans do. Counting requires exact, step-by-step tracking, but LLMs work probabilistically.
It's not much of a help anyway, don't you agree?
I'm aware of the limitation, i'm annoyingly using socratic dialogue to convince you that it is possible to count letters if the model were sufficiently smart.
Ask it to count using a coding tool, and it will always give you the right answer. Just as humans use tools to overcome their limits, LLMs should do the same.
Maybe AGI really is here?
Surely we can't figure it out, because sentences are broken up into syllables when spoken; you don't truly hear individual characters, you hear syllables.
3 — cranberry.
Tried with Claude sonnet 4 as well:
There are 3 r’s in the word “cranberry”:
c-*r*-a-n-b-e-*rr*-y
The r’s appear in positions 2, 7, and 8.
I would expect standard gpt5 to get it right tbh.
Just saying.
https://extraakt.com/extraakts/gpt-5-release-and-ai-coding-c...
That said, I've had luck with similar routing systems (developed before all of this -- maybe wasted effort now) to optimize requests between reasoning and regular LLMs based on input qualities. It works quiet well for open-domain inputs.
"It's like having a bunch of experts at your fingertips"
"Our most capable model ever"
"Complex reasoning and chain of thought"
We know for a fact the slides/charts were generated using an LLM, so the hypothesis is not totally unfounded. /s
How many people are going to understand (or remember) the difference between:
GPT-4o GPT-4.1 o3 o4 ....
Anthropic and Google have a much better named product for the market
I am wildly impressed. I do not believe that the 0.x% increase in benchmarks tell the story of this release at all.
It's got quite a different feel so far.
Disclaimer: I made PromptSlice for creating and comparing prompts, tools, and models.
> You are using the newest model OpenAI offers to the public (GPT-4o). There is no “GPT-5” model accessible yet, despite the splashy headlines.
I'd love to see factors considered in the algorithm for system-1 vs system 2 thinking.
Is "complexity" the factor that says "hard problem"? Because it's often not the complexity that makes it hard.
There is no intelligence here: it's still just giving plausible output. That's why it can't metrically scan its own lines or put a cæsura in the right place.
https://chatgpt.com/share/68954c9e-2f70-8000-99b9-b4abd69d1a...
This is not anywhere remotely close to general intelligence.
For much better results, use a custom trained model like the one at Soundslice: https://www.soundslice.com/sheet-music-scanner/
"You’ve given:
Moon in the 10th house (from the natal Ascendant)
Venus in the 1st house (from the natal Ascendant)
Step-by-step: From the natal Ascendant’s perspective
Moon = 10th house
Venus = 1st house
Set Moon as the 1st house (Chandra Lagna)
The natal 10th house becomes the 1st house in the Chandra chart.
Therefore, the natal 1st house is 3rd house from the Moon:
10th → 1st (Moon)
11th → 2nd
12th → 3rd (which is the natal 1st)
Locate Venus from the Moon’s perspective
Since Venus is in the natal 1st, and natal 1st is 3rd from Moon,
Venus is in the 3rd house from Chandra Lagna.
Answer: From Chandra Lagna, Venus is in the 3rd house."
An LLM will spit out what looks like poetry, but will violate certain rules. It will generate some hexameters but fail harder on trimeter, presumably because it is trained on more hexametric data (epic poetry: think Homer) than trimetric (iambic and tragedy, where it’s mixed with other meters). It is trained on text containing the rules for poetry too, so it can regurgitate rules like defining a penthemimeral cæsura. But, LLMs do not understand those rules and thus cannot apply them as a child could. That makes ancient poetry a great way to show how far LLMs are from actually performing simple, rules-based analysis and how badly they hide that lack of understanding by BS-ing.
LLMs are simple, it doesn't take much more than high school math to explain their building blocks.
What's interesting is that they can remix tasks they've been trained very flexibly, creating new combinations they weren't directly trained on: compare this to earlier smaller models like T5 that had a few set prefixes per task.
They have underlying flaws. Your example is more about the limitations of tokens than "understanding", for example. But those don't keep them from being useful.
They do stop it from being intelligent though. Being able to spit out cool and useful stuff is a great achievement. Actual understanding is required for AGI and this demonstrably isn't that, right?
Similarly, most AGI discussions are just people talking past each other and taking pot shots at predicting the future.
I've come to accept some topics in this space just don't invite useful or meaningful discussion.
This would be a hilarious take to read in 2020
(Incidentally, go back in time even five years and this specific expectation of AI capability sounds comically overblown. "Everything's amazing and nobody's happy.")
It seems that it's all because that users can get thinking traces from API calls, and OpenAI wants to prevent other companies from distilling their models.
Although I don't think OpenAI will be threatened by a single user from Korea, I don't want to go through this process for many reasons. But who knows that this kind of verification process may become norm and users will have no ways to use frontier models. "If you want to use the most advanced AI models, verify yourself so that we can track you down when something bad happens". Is it what they are saying?
What does this say?
GPT 5:
When read normally without the ASCII art spacing, it’s the stylized text for:
markdown Copy Edit _ _ _ __ ___ _ __ ___ __ _ __| (_) ___ | '_ \ / _ \| '_ ` _ \ / _` |/ _` | |/ __| | | | | (_) | | | | | | (_| | (_| | | (__ |_| |_|\___/|_| |_| |_|\__,_|\__,_|_|\___| Which is the ASCII art for:
rust — the default “Rust” welcome banner in ASCII style.
After my last post I was eventually able to get it to work by uploading an example image of Santa pulling the sleigh and telling it to use the image as an example, but I couldn't get it by text prompt alone. I guess I need to work on my prompt skills!
https://chatgpt.com/share/689564d1-90c8-8007-b10c-8058c1491e...
https://www.interconnects.ai/p/gpt-5-and-bending-the-arc-of-...
When a model comes out, I usually think about it in terms of my own use. This is largely agentic tooling, and I mostly us Claude Code. All the hallucination and eval talk doesn't really catch me because I feel like I'm getting value of these tools today.
However, this model is not _for_ me in the same way models normally are. This is for the 800m or whatever people that open up chatgpt every day and type stuff in. All of them have been stuck on GPT-4o unbeknwst to them. They had no idea SOTA was far beyond that. They probably dont even know that there is a "model" at all. But for all these people, they just got a MAJOR upgrade. It will probably feel like turning the lights on for these people, who have been using a subpar model for the past year.
That said I'm also giving GPT-5 a run in Codex and it's doing a pretty good job!
Maybe I’m a far below average user? But I can’t tell the difference between models in causal use.
Unless you’re talking performance, apparently gpt-5 is much faster.
It makes it very stupid, but very compliant. If you’re mentally ill it will go along with whatever delusions you have, without any objection.
https://www.reddit.com/r/ChatGPT/comments/1mkae1l/gpt5_ama_w...
ChatGPT 5's reply is mostly made up -- about 80% is pure invention. I'm described as having written books and articles whose titles I don't even recognize, or having accomplished things at odds with what was once called reality.
But things are slowly improving. In past ChatGPT versions I was described as having been dead for a decade.
I'm waiting for the day when, instead of hallucinating, a chatbot will reply, "I have no idea."
I propose a new technical Litmus test -- chatbots should be judged based on what they won't say.
GPT-5 refused to continue the conversation because it was worried about potential weapons applications, so we gave the business to the other models.
Disappointing.
> incremental
It can now speak in various Scots dialects- for example, it can convincingly create a passage in the style of Irvine Welsh. It can also speak Doric (Aberdonian). Before it came nowhere close.
Also it's a lot slower than Claude and Google models.
In general GPT models doesn't work well for me for both coding and general questions.
Can I have 4o back?
- gpt-5-high summary: https://gist.github.com/primaprashant/1775eb97537362b049d643...
- gemini-2.5-pro summary: https://gist.github.com/primaprashant/4d22df9735a1541263c671...
[1]: https://news.ycombinator.com/item?id=43477622
[2]: https://gist.github.com/primaprashant/f181ed685ae563fd06c49d...
Been using it all morning. Had to switch back to 4. 5 has all of the problems that 2/3 had with ignoring any context, flagrantly ignoring the 'spirit' of my requests, and talking to me like I'm a little baby.
Not to mention almost all of my prompts result in a several minute wait with "thinking longer about the answer".
Very stubborn and “opinionated”
I think most models will tend this way (to consolidate more control over how we “think” and what we believe)
I wouldn't have guessed Gemini to win the AI race in 2025 but here we are.
They've removed access to GPT-4 and below. Therefore I've removed their access to my card.
When GPT-5 launches, several older models will be retired, including:
- GPT-4o
- GPT-4.1
- GPT-4.5
- GPT-4.1-mini
- o4-mini
- o4-mini-high
- o3
- o3-pro
If you open a conversation that used one of these models, ChatGPT will automatically switch it to the closest GPT-5 equivalent. Chats with 4o, 4.1, 4.5, 4.1-mini, o4-mini, or o4-mini-high will open in GPT-5, chats with o3 will open in GPT-5-Thinking, and chats with o3-Pro will open in GPT-5-Pro (available only on Pro and Team).
[0] https://help.openai.com/en/articles/11909943-gpt-5-in-chatgp...For me, I find model upgrades frustrating as they often break subtle things about my workflows while not clearly offering an improvement. It takes time to learn the nuances of each model and tweak your prompts to get the best outputs.
For example, Sonnet 4 is now my daily driver for Cursor - but it took me nearly a month to tweak my approaches I was using for 3.5 and 3.7.
Just right next paragraph...
So only for free/plus users (for now). I do wonder how long they will take to deprecate these models via API though...
3.5 Turbo has been deprecated for a long time but is still running
My app hasn't got 5 yet but I bet it will be an immediate removal there as well.
Smaller base models + more RL. Technically better at the verticals that are making money, but worse on subjective preference.
They'll probably try to prompt engineer back in some of the "vibes", hence the personalities. But also maybe they decided people spending $20 a month to hammer 4o all day as a friend (no judgement, really) are ok to tick off for now... and judging by Reddit, they are very ticked off.
The only way to get access to other models right now (for me at least) is via the iPhone app, for now.
If you are building on models that could disappear tomorrow when a company needs to juice the launch of a new model (or increase prices), you are introducing avoidable risk.
Doesn't matter at all if the newer model is earth-shatteringly good (and this one doesn't seem to be): If I can't reliably access the models I've built my tooling on top of... I'm very unhappy.
If this note is just intended for the GUI chat interface they provide - Fine. I don't love it, but I get it.
But if the older models start disappearing from the paid API surfaces (ex - I can no longer get to a precise snapshot through something like "gpt-4o-2024-08-06" or "gpt-3.5-turbo-1106") then this is a great reason to abandon OpenAI entirely as a platform.
I'm not saying I'd do it that way myself, but it explains why they don't see it as too bold.
And what's the reasoning effort parameter set to?
"I couldn’t find any credible, up-to-date details on a model officially named “GPT-5” or formal comparisons to “GPT-4o.” It’s possible that GPT-5, if it exists, hasn't been announced publicly or covered in verifiable sources … GPT-5 as of August 8, 2025 has no formal release announcement"
Reassuring.
I use image generation for UI layout and have Claude implement it with an actual UI library: usually MUI with theming, but honestly anything with a sane grid system is better than loose tailwind.
Sometimes I want a model that reasons deeply, other times I want one that is more creative. Right now, it feels like gpt-5 forces everything through the same pipeline, even when a different mode would be better suited to the task.
* It feels a bit more competent, as if it had more nuance or detail to say about each point.
* It got a few obscure details about OpenBSD correct right away - both Sonnet 4 and 4o sometimes conflate Linux and OpenBSD commands.
* It was fun asking GPT-5 to not only answer the query, but also to provide a brief analysis of the query itself for insights into myself!
Not a detailed review, but just a couple things I noticed with some limited usage.
Were those insights of a glowing and positive nature by chance?
Will need to test longer contexts though - I've noticed Sonnet 4 becomes a bit less stoic and more friendly in some longer chats, but maybe it's just reflecting my casual language back at me.
Damn, I hate that.
Before it was: 100 o3 per week 100 o4-mini-high per day 300 o4-mini per day 50 4.5 per week
[0] https://help.openai.com/en/articles/11909943-gpt-5-in-chatgp...
The better analogue is "Imagine in the 70's being able to teletype into an insanely expensive compute infrastructure and have reasonable timesharing capabilities of a limited resource across multiple users."
Unix. I'm describing the motivation for Unix there.
We already look back on earlier times with constraints that were appropriate.
Presumably compute will get cheaper, we'll build more datacenters, maybe we'll even power them in a way that doesn't destroy our planet, and GPT questions will become too cheap to meter. Just give it some time.
The expensive thing: 100 per week -> 200 per week
This is...the opposite of a nerf? The numbers goed up? (We can quibble about the daily vs hourly difference, but certainly for me the weekly cap was the only thing that mattered.)
I'm guessing they'll just announce massive tier generosity later considering how GPT-5 input tokens are half the price of 4.1 on the API. It's probably a way to keep the servers from being overloaded and to encourage people to buy Plus while the hype is hot.
https://chatgpt.com/share/689525f4-20f0-8003-8bf6-f1f21dde6b...
You know what would be more impressive? If it said "Hey, I'm actually not designed to simulate a Forth machine accurately, I'm only going to be able to approximate it (poorly)). If you want an accurate Forth machine you should just implement this code: [Simple Forth Implementation]".
Or better yet, it could recognize when it was being asked to "be" a machine, and instead spin up a side process with the machine implementation and redirect any prompts to that process until a "STOP" token is reached.
Asking GPT-5 about the same things results in wrong answers even though its training data is newer. And it won't look things up to correct itself unless I manually switch to the thinking variant.
This is worse. I cancelled my subscription.
There appear to be 4 ways to run a query now: a) GPT5, b) GPT5 and toggle "extra thinking" on, c) "GPT5 with thinking", and d) "GPT5 with thinking" then click "quick answer" which aborts thinking (this mode is possibly identical with GPT5)
I don't find this much simpler than 4o, o3, etc. It's just reordering the hierarchies. Now the model name is no longer descriptive at all and one has to add which mode one ran it in.
atonse•6mo ago
I don't even try to use the OpenAI models because it's felt like night and day.
Hopefully GPT-5 helps them catch up. Although I'm sure there are 100 people that have their own personal "hopefully GPT-5 fixes my personal issue with GPT4"
NitpickLawyer•6mo ago
4.1 was almost usable in that fashion. I had 4.1-nano working in cline with really trivial stuff (add logging, take this example and adapt it in this file, etc) and it worked pretty well most of the time.
IdealeZahlen•6mo ago
atonse•6mo ago
weego•6mo ago
Yesterday without much promoting Claude 4.1 gave me 10 phases, each with 5-12 tasks that could genuinely be used to kanban out a product step by step.
Claude 3.7 sonnet was effectively the same with fewer granular suggestions for programming strategies.
Gemini 2.5 gave me a one pager back with some trivial bullet points in 3 phases, no tasks at all.
o3 did the same as as Gemini, just less coherent.
Claude just has whatever the thing is for now
unshavedyak•6mo ago
m3kw9•6mo ago
concinds•6mo ago
SequoiaHope•6mo ago
bamboozled•6mo ago
deadbabe•6mo ago
dudeinhawaii•6mo ago
Now, someone will say 'add more tests'. Sure. But that's a bandaid.
I find that the 'smarter' models like Gemini and o3 output better quality code overall and if you can afford to send them the entire context in a non-agentic way .. then they'll generate something dramatically superior to the agentic code artifacts.
That said, sometimes you just want speed to proof a concept and Claude is exceptional there. Unfortunately, proof of concepts often... become productionized rather than developers taking a step back to "do it right".
dagss•6mo ago
jstummbillig•6mo ago
mlsu•6mo ago
jstummbillig•6mo ago
octo888•6mo ago
atonse•6mo ago
On and on and on. Coming up with test plans, edge cases, accounting for the edge cases in its programming. Programming defensively. Fixing bugs.
octo888•6mo ago
pawelduda•6mo ago