One might characterize it as an improvement in the document-style which the model operates upon.
My favorite barely-a-metaphor is that the "AI" interaction is based on a hidden document that looks like a theater script, where characters User and Bot are having a discussion. Periodically, the make_document_longer(doc) function (the stateless LLM) is invoked to to complete more Bot lines. An orchestration layer performs the Bot lines towards the (real) user, and transcribes the (real) user's submissions into User dialogue.
Recent improvements? Still a theater-script, but:
1. Reasoning - The Bot character is a film-noir detective with a constant internal commentary, not typically "spoken" to the User character and thus not "performed" by the orchestration layer: "The case was trouble, but I needed to make rent, and to do that I had to remember it was Georgia the state, not the country."
2. Tools - There are more stage-directions, such as "Bot uses [CALCULATOR] inputting [sqrt(5)*pi] and getting [PASTE_RESULT_HERE]". Regular programs are written to parse the script, run tools, and then replace the result.
Meanwhile, the fundamental architecture and the make_document_longer(doc) haven't changed as much, hence the author's title of "not model improvement."*
This was an unusual task Bot wasn't sure how to solve directly.
Bot decided it needed to execute a program:
[CODE_START]foo(bar(baz())[CODE_END]
Which resulted in
[CODE_RESULT_PLACEHOLDER]
This stage-direction is externally parsed, executed, and substituted, and then the LLM is called upon to generate Bot-character's next reaction.In terms of how this could go wrong, it makes me think of a meme:
> Thinking quickly, Dave constructs a homemade megaphone, using only some string, a squirrel, and a megaphone.
- finding patterns in data is memorization
- finding patterns in metadata is intelligence
- finding patterns in meta-metadata is invention
For example, if you ask someone to hang a painting in an art gallery 12 feet from the floor using a 13-foot ladder:
- a worker will use the safety rule of staying 5 feet away from the wall. This is what GPT-3 does. [1]
- an engineer will apply the Pythagorean theorem. This is what o3 does.
- Pythagoras, seeing it for the first time, will derive the theorem. GPT-5 is nowhere close to that.
This climbing up the ladder of abstraction existed even before LLMs. DeepMind's AlphaGo learned from human games. But AlphaGo Zero and AlphaZero trained entirely through self-play and began uncovering new strategies across Go, chess, and shogi. So whether it's code, a game, or pseudocode, they're all metadata operating at the same level of abstraction.
[1] The Nature of Intelligence is Meta - https://manidoraisamy.com/intelligence-is-meta.html
Though to that end, I wonder if the model "knows" that it "understands" the fundamentals better once it's been trained like this, or if when it has to do a large multiplication as part of a larger reasoning task, does it still break it down step by step.
I'm not sure why we should be dissatisfied with that?
I don't think OpenAI launching ChatGPT Apps and Atlas signals they're pivoting.
It's just that when you raise that much money you must deploy it in any possible direction.
> Unlike GPT-3, which at least attempted arithmetic internally (and often failed), o1 explicitly delegates computation to external tools.
How is it a bad thing? Does the author really believe this is a bad thing?
Even if we believe tech bros' most wild claim - AGI is around the corner - I still don't know why calling external tools makes an AGI less AGI.
If you ask Terence Tao what 113256289421x89831475287 is I'm quite sure he'd "call external tools." Does it make him less a mathematician?
Plus, this is not what people call "reasoning." The title:
> Reasoning Is Not Model Improvement
The content:
> (opening with how o1 is calling external tools for arithmetic)
...anyway, whatever. I guess it's a Cunningham's Law thing. Otherwise it's a bit puzzling why someone knows nothing about a topic had to write an article to make everyone know how clueless they are.
Reasoning is about working through problems step-by-step. This is always going to be necessary for some problems (logic solving, puzzles, etc) because they have a known minimum time complexity and fundamentally require many steps of computation.
Bigger models = more width to store more information. Reasoning models = more depth to apply more computation.
> When you ask o1 to multiply two large numbers, it doesn't calculate. It generates Python code, executes it in a sandbox, and returns the result.
That's not true of the model itself, see my comment here which demonstrates it multiplying two large numbers via the OpenAI API without using Python: https://news.ycombinator.com/item?id=45683113#45686295
On GPT-5 it says:
> What they delivered barely moved the needle on code generation, the one capability that everything else depends on.
I don't think that holds up. GPT-5 is wildly better at coding that GPT-4o was (and got even better with GPT-5-Codex). A lot of people have been ditching Claude for GPT-5 for coding stuff, and Anthropic held the throne for "best coding model" for well over a year prior to that.
From the conclusion:
> All [AI coding startups] betting on the same assumption: models will keep getting better at generating code. If that assumption is wrong, the entire market becomes a house of cards.
The models really don't need to get better at generating code right now for the economic impact to be profound. If progress froze today we could still spend the next 12+ months finding new ways to get better results for code out of our current batch of models.
Even if GPT-5 was less capable of a coder than Claude, I'd still not use Claude because of its ridiculous quotas, context window restrictions, slowness, and Anthropic's pedantic stance on AI safety.
[Edit] Its probably premature to argue without the above data, but if we assume tool use gives ~100% accuracy and reasoning-only ~90%, then that 10% gap might represent the loss in the probabilistic model: either from functional ambiguity in the model itself or symbolic ambiguity from tokenization?
My o1 call in https://gist.github.com/simonw/a6438aabdca7eed3eec52ed7df64e... used 16 input tokens and produced 2357 output tokens (1664 were reasoning). At o1's price that's 14 cents! https://www.llm-prices.com/#it=16&ot=2357&ic=15&cic=7.5&oc=6...
I can't call o1 with the Python tool via the API, so I'll have to provide the price for the GPT-5 example in https://gist.github.com/simonw/c53c373fab2596c20942cfbb235af... - that one was 777 input tokens and 140 output tokens. Why 777 input tokens? That's a bit of a mystery to me - my assumption is that a bunch of extra system prompt stuff gets stuffed on describing that coding tool.
GPT-5 is hugely cheaper than o1 so that cost 0.22 cents (almost a quarter of a cent) - but if o1 ran with the same number of tokens it would only cost 1.94 cents: https://www.llm-prices.com/#it=777&ot=130&sel=gpt-5%2Co1-pre...
When you ask the latest model, ChatGPT-5 to multiply two large numbers, it doesn't calculate. It generates Python code, executes it in a sandbox, and returns the result. Unlike ChatGPT-3, which at least attempted arithmetic internally (and often failed), ChatGPT-5 delegates computation to external tools. [1]
And added this note:
[1] There are 2 ways to multiply numbers in GPT-5:
- Python mode, which uses python sandbox as mentioned above
- No tool mode, which uses internal reasoning
Python mode is approximately 2x more accurate than no tool mode in FrontierMath (26.3% vs 13.5% accuracy on expert level math). Python mode is also 4x to 10x more cost effective than no tool mode. The GPT-5 API uses no-tool mode by default (tools must be explicitly enabled in API calls), while ChatGPT UI likely uses Python mode by default since Advanced Data Analysis is enabled by default for all Plus, Team, and Enterprise subscribers. This creates a significant cost optimization for OpenAI in the consumer product, while API users bear the full cost of inefficient reasoning unless they manually configure tool use.
---
Thanks again for flagging the inaccuracy, Simon! If you think any part of this update still misrepresents the model behavior, I’d love your input.
LLMs are very good at imitating moderate-length patterns. It can usually keep an apparently sensible conversation going with itself for at least a couple thousand words before it goes completely off the rails, although you never know exactly when it will go off the rails; it's very unlikely to be after the first sentence, far more likely to be after the twenty-first, and will never get past the 50th. If you inject novel input in periodically (such as reminding and clarifying prompts), you can keep the plate spinning longer.
So some tricks work right now to extend the amount of time the thing can go before falling into the inevitable entropy that comes from talking to itself too long, and I don't think that we should assume that there won't ever be a way to keep the plate spinning forever. We may be able to do it practically (making it very unusual for them to fall apart), or somebody may come up with a way to make them provably resilient.
I don't know if the current market leaders have any insight into how to do this, however. But I'm also sure that an LLM reaching for a calculator and injecting the correct answer into the context keeps that context useful for longer than if it hadn't.
Not to say that GPT is conscious, in its current form I think it certainly isn't, but rather I would say reasoning is a positive development, not an embarrassing one
I can't compute 297298*248 immediately in my head, and if I were to try it I'd have to hobble through a multiplicaion algorithm, in my head... it's quite simlar to what they're doing here, it's just they can wire it right into a real calculator instead of slowly running a shitty algo on wetware
Now, ideally, the LLMs could also design their own tools, when they realize there is a recurring task that can be accomplished better and more reliably by coding up a tool.
There seems to be a bit of "if your only tool is a hammer" going on with the desire to have a single model do everything.
It's literally the same thing. Sure, OpenAI's branding of ChatGPT as a product with GPT-5 is confusing, because GPT-5 is both a MODEL and a PRODUCT (collection of models, including GPT-5).
But does it matter?
QueensGambit•3mo ago
1. On o1's arithmetic handling: I claim that when o1 multiplies large numbers, it generates Python code rather than calculating internally. I don't have full transparency into o1's internals. Is this accurate?
2. On model stagnation: I argue that fundamental model capabilities (especially code generation) have plateaued, and that tool orchestration is masking this. Do folks with hands-on experience building/evaluating models agree?
3. On alternative architectures: I suggest graph transformers that preserve semantic meaning at the word level as one possible path forward. For those working on novel architectures - what approaches look promising? Are graph-based architectures, sparse attention, or hybrid systems actually being pursued seriously in research labs?
Would love to know your thoughts!
Workaccount2•3mo ago
Terr_•3mo ago
lawlessone•3mo ago
cpa•3mo ago
MoltenMan•3mo ago
simonw•3mo ago
Here's OpenAI's tweet about this: https://twitter.com/SebastienBubeck/status/19465776504050567...
> Just to spell it out as clearly as possible: a next-word prediction machine (because that's really what it is here, no tools no nothing) just produced genuinely creative proofs for hard, novel math problems at a level reached only by an elite handful of pre‑college prodigies.
My notes: https://simonwillison.net/2025/Jul/19/openai-gold-medal-math...
They DID use tools for the International Collegiate Programming Contest (ICPC) programming one though: https://twitter.com/ahelkky/status/1971652614950736194
> For OpenAI, the models had access to a code execution sandbox, so they could compile and test out their solutions. That was it though; no internet access.
emp17344•3mo ago
simonw•3mo ago
Given how much bad press OpenAI got just last week[1] when one one of their execs clumsily (and I would argue misleadingly) described a model achievement and then had to walk it back amid widespread headlines about their dishonesty, those researchers have a VERY strong incentive to tell the truth.
[1] https://techcrunch.com/2025/10/19/openais-embarrassing-math/
emp17344•3mo ago
simonw•3mo ago
It's also worth taking professional integrity into account. Even if OpenAI's culture didn't value the truth individual researchers still care about being honest.
emp17344•3mo ago
In OpenAI’s case, this isn’t exactly the first time they’ve been caught doing something ethically misguided:
https://techcrunch.com/2025/01/19/ai-benchmarking-organizati...
simonw•3mo ago
famouswaffles•3mo ago
simonw•3mo ago
If you call the OpenAI API for o1 and ask it to multiply two large numbers it cannot use Python to help it.
Try this:
Here's what I got back just now: https://gist.github.com/simonw/a6438aabdca7eed3eec52ed7df64e...o1 correctly answered the multiplication by running a long multiplication process entirely through reasoning tokens.
alganet•3mo ago
> "tool_choice": "auto"
> "parallel_tool_calls": true
Can you remake the API call explicitly asking it to not perform any tool calls?
simonw•3mo ago
Those are its default settings whether or not there are tools configured. You can set tool_choice to the name of a specific tool in order to force it to use that tool.
I added my comment here to show an example of an API call with Python enabled: https://news.ycombinator.com/item?id=45686779
Update: Looks like you can add "tool_choice": "none" to prevent even tools you have configured from being called. https://platform.openai.com/docs/api-reference/responses/cre...
alganet•3mo ago
Can you remake the call explicitly using the value `none`?
Maybe it's not using Python, but it's using something else. I think it's a good test. If you're right, then the response shouldn't change.
Update: `auto` is ambiguous. It doesn't say whether is picking from your selection of tools or the pool of all available tools. Explicit is better than implicit. I think you should do the call with `none`, it can't hurt and it can prove me wrong.
simonw•3mo ago
I promise you it is not using anything else. It is performing long multiplication entirely through model reasoning.
(I suggest getting your own OpenAI API key so you can try these things yourself.)
simonw•3mo ago
OpenAI's gpt-oss-20b is a 12GB download for LM Studio from https://lmstudio.ai/models/openai/gpt-oss-20b
It turns out it's powerful enough to solve this. Here's the thinking trace:
And a screenshot: https://gist.github.com/simonw/a8929c0df5f204981652871555420...photonthug•3mo ago
To summarize, with large numbers it goes nuts trying to find a trick or shortcut. After I cut off dead-ends in several trials, it always eventually considers long form addition, then ultimately rejects it as "tedious" and starts looking for "patterns". Wait, let me use the standard multiplication algorithm step by step, oh that's a lot of steps, break it down into parts. Let me think. Over ~45 minutes of thinking (I'm on CPU), but it basically cannot follow one strategy long enough to complete the work even if landed on a sensible approach.
For multiplying two-digit numbers, it does better. Starts using the "manual way", messes up certain steps, then gets the right answer for sub-problems anyway because obviously those are memoized somewhere. But at least once, it got the correct answer with the correct approach.
I think this raises the question, if you were to double the size of your input numbers and let the more powerful local model answer, could it still perform the process? Does that stop working for any reason at some point before the context window overflows?
rahimnathwani•3mo ago
rahimnathwani•3mo ago
8,657,216,880,231,672
I asked gpt-oss-20B the question three times. It took different routes each time. The first time it made some mistakes early on and then spent ages getting more confused. The other two attempts were successful.
alganet•3mo ago
simonw•3mo ago
rahimnathwani•3mo ago
alganet•3mo ago
I can see this call now has a lot more tokens for the reasoning steps. Maybe that's normal variance though.
(I don't have a particular interest in proving or disproving LLM things, so there's no incentive for me to get a key). There was an ambiguous point in the "proof", I just highlighted it.
simonw•3mo ago
You can also get an account with something like https://openrouter.ai/ which gives you one key to use with multiple different backends.
Or use GitHub Models which gives you free albeit limited access to a bunch at once. https://github.com/marketplace/models
alganet•3mo ago
Lots of people don't have resources to invest in LLMs (either self-hosted or not). They rely on what other people say. And people get caught in the hype all the time. As it turns out, lots of hype nowadays is around LLMs, so that's where I'll go.
I was skeptic about LK99. Didn't had the resources to independently verify it. It doesn't mean I don't believe in superconductors or that I should have no say in it.
Some of that hype will be justified, some will not. And that's exactly what I expect from this kind of technology.
simonw•3mo ago
alganet•3mo ago
I can invest lots of time in Linux, for example. I don't know how to write a driver for it, but I know I could learn how to do it. If there's a bug in a driver, there's nothing stopping me except my own will to learn. I can also do it in a potato, or my phone.
I can experiment with free tier LLMs, but that's as far as I will go. It's not just about me, that is as far as 99% of the developers will go.
So, it's not uninteresting because it's boring or something. It's uninteresting because it puts a price on learning. That horizon of "if there's a bug in it, I can fix it" is severely limited. That's a price most free software developers are not considering worthy. There's a lot of us.
simonw•3mo ago
I love learning about software. That's why I'm leaning so heavily on LLMs these days - they let me learn so much faster, and let me dig into whole new areas that previously I would never have considered experimenting with.
Just this week LLMs helped me figure out how to run Perl inside WebAssembly in a browser... and then how to compile 25-year-old C code to run in WebAssembly in the browser too. https://simonwillison.net/2025/Oct/22/sloccount-in-webassemb...
If I'd done this without LLMs I might have learned more of the underlying details... but realistically I wouldn't have done this at all, because my interest in Perl and C in WebAssembly is not strong enough to justify investing more than a few hours of effort.
alganet•3mo ago
A while back, I didn't even knew those problems existed. It took me a while to understand them and why they're interesting and lots of people spend time on them.
I have tried to adapt the problems to the LLMs as well, such as shaping the problem to look more like a thing that they're alreay trained on, but I soon realized the limitations of that approach.
I think in a couple of decades, maybe earlier, that kind of thing will be commonplace. People training their own stuff from scratch, on cheap hardware. It will unleash an even more rewarding learning experience for those willing to go the extra mile.
I think you're missing that perspective. That's fine, by the way. You're totally cool and probably helping lots of people with your work. I support it, it allows people to understand better where LLMs currently can help and where they cannot.
simonw•3mo ago
One of the reasons I am so excited about the "skills" concept from Anthropic is that it helps emphasize how the latest generation of LLMs really can pick up new capabilities if you get them to read a single, carefully constructed markdown file.
alganet•3mo ago
https://github.com/fosslinux/live-bootstrap/
Other efforts around the same problem are trying to make it more architecture independent or improve regenerations (re-building things like automake during the process).
It's free and open source, you're welcome to fork it and try your best with the aid of Claude. All you need is an x86 or x86-64 machine or qemu.
The project and other related repositories are already full of documentation in the markdown format and high quality commented code.
Here's a friendly primer on the problem:
https://www.youtube.com/watch?v=Fu3laL5VYdM
If you decide to help, please ask the maintainers if AI use is allowed beforehand. I'm OK with it, they might not be.
simonw•3mo ago
Note this bit where the code interpreter Python tool is called:
anonymoushn•3mo ago
remich•3mo ago
ACCount37•3mo ago
1. You can enable or disable tool use in most APIs. Generally, tools such as web search and Python interpreter give models an edge. The same is true for humans, so, no surprise. At the frontier, model performance keeps climbing - both with tool use enabled and with it disabled.
2. Model capabilities keep improving. Frontier models of today are both more capable at their peak, and pack more punch for their weight, figuratively and literally. Capability per trained model weight and capability per unit of inference compute are both rising. This is reflected directly in model pricing - "GPT-4 level of performance" is getting cheaper over time.
3. We're 3 years into the AI revolution. If I had ten bucks for every "breakthrough new architecture idea" I've seen in a meanwhile, I'd be able to buy a full GB200 NVL72 with that.
As a rule: those "breakthroughs" aren't that. At best, they offer some incremental or area-specific improvements that could find their way into frontier models eventually. Think +4% performance across the board, or +30% to usable context length for the same amount of inference memory/compute, or a full generational leap but only in challenging image understanding tasks. There are some promising hybrid approaches, but none that do away with "autoregressive transformer with attention" altogether. So if you want a shiny new architecture to appear out of nowhere and bail you out of transformer woes? Prepare to be disappointed.
throwthrowrow•3mo ago
The original question still stands: do recent LLMs have an inherent knowledge of arithmetic, or do they have to offload the calculation to some other non-LLM system?
ACCount37•3mo ago
Which includes, among other things, the underappreciated metacognitive skill of "being able to decide when to do math quick and dirty, in one forward pass, and when to write it out explicitly and solve it step by step".
Today's frontier LLMs can do that. A lot of training for "reasoning" is just training for "execute on your knowledge reliably". They usually can solve math problems with no tool calls. But they will tool call for more complex math when given an option to.
Terr_•3mo ago
[0] https://www.mindprison.cc/p/why-llms-dont-ask-for-calculator...
vrighter•3mo ago
ACCount37•3mo ago
Which was also when the capabilities of LLMs became completely impossible to either ignore or excuse as "just matching seen data". But that was, in practice, solvable simply by increasing the copium intake.
XenophileJKO•3mo ago
This improved think->act->sense loop that they now form, exponentially increases the possible utility of the models. We are just starting to see this with gpt-5 and the 4+ series of Claude models.
emp17344•3mo ago
XenophileJKO•3mo ago
emp17344•3mo ago
remich•3mo ago
Caveat that we don't fully understand how human intelligence works, but with humans it's generally true that skills are not static or siloed. Improving in one area can generate dividends in others. It's like how some professional football players improve their games by taking ballet lessons. Two very different skills, but the incorporation of one improves the other as well as the whole.
I would argue that narrowly focusing on LLM performance via benchmarks before tool use is incorporated is interesting, but not particularly relevant to whether they are transformative, or even useful, as products.
mirekrusin•3mo ago
Reasoning just means more implicit chain-of-thought. It can be emulated by non reasoning model by explicitly constructing prompt to perform longer step by step thought process. With reasoning models it just happens implicitly, some models allow for control over reasoning effort with special tokens. Those models are simply fine tuned to do it themselves without explicit dialogue from the user.
Tool calling happens primarily on the client side. Research/web access mode etc made available by some providers (based on tool calling that they handle themselves) is not a property of a model, can be enabled on any model.
Nothing plateaued from where I'm standing – new models are being trained, releases happen frequently with impressive integration speed. New models outperform previous ones. Models gain multi modality etc.
Regarding alternative architectures – there are new ones proposed all the time. It's not easy to verify all of them at scale. Some ideas that are extending current state of art architectures end up in frontier models - but it takes time to train so lag does exist. There are also a lot of improvements that are hidden from public by commercial companies.
Legend2440•3mo ago
Both reasoning and non-reasoning models may choose to use the Python interpreter to solve math problems. This isn't hidden from the user; it will show the interpreter ("Analyzing...") and you can click on it to see the code it ran.
It can also solve math problems by working through them step-by-step. In this case it will do long multiplication using the pencil-and-paper method, and it will show its work.
mxkopy•3mo ago
More broadly I think what we’re looking for at the end of the day, AGI, is going come about from a diaspora of methods capturing the diverse aspects of what we recognize as intelligence. ‘Precise deductive reasoning’ is one capability out of many. Attention isn’t all you need, neither is compression, convex programming, what have you. The perceived “smoothness” or “unity” of our intelligence is an illusion like virtual memory hiding cache, and building it is going to look a lot more like stitching these capabilities together than deriving some deep and elegant equation.
kgeist•3mo ago