Also I wonder if the LLM "knows" that it has this capability after fine-tuning. If it encounters multiplication as part of some larger chain-of-thought, will it solve that internally, or will it continue to do it step-by-step in the chain-of-thought?
You can sit here and force me to recite ("train me on") multi-digit multiplication problems and their result until the day I die, and my language model is only going to get marginally better. It is in practicing my symbolic manipulation that I'm going to get better and faster.
It seems to me that expecting a Language Model to be very good at multiplication is asking for a substantially superhuman level of performance from them, and one that we have little reason to believe will scale anyhow. What we need is symbolic manipulation, better than the approximation they achieve when "reasoning".
I find it rather ironic to sit here and use the aforementioned 15 orders of magnitude improvement over the Commodore PET to use that level of symbolic manipulation firepower to laboriously recreate a software system that is as bad as we are at multiplication for what may well be the same fundamental reasons... and then have the audacity to complain about it. My metaphorical dude, you did a couple trillion multiplications just to get to this single bad multiplication output... maybe another approach is called for.
To disprove my point, please generate a list of 5 random 5-digit numbers and demonstrate multiplying them in your head as quickly as you can read them. Since you can't, clearly there is something about that that is hard for you, despite the fact that the act of reading this text, maintaining physical homeostasis while you do it, and all the other things your brain is doing as you do this represents a staggering amount of raw computation that is vastly, vastly in excess of what is nominally needed to achieve that computation.
Mathematics was born out of very careful reasoning that we do through language, we only use formalisms as they allow us to avoid the massive ambiguities that exist in natural language. Formal symbolic manipulation came out of our already existing abilities of symbolic manipulation through language.
I think most humans that do math aren't actually literally computing things as some kind of logic machine.
We can produce logic, and follow the steps of using that logic, but it doesn't seem to me that our cognition is some kind of logic machine itself.
I guess the challenge is, where would the training data come from? Data on the internet is in its final form so "next token" is never a delete.
Edit: I guess in essence, that's what reasoning LLMs already do. IIUC the thought blocks are ephemeral, and only the response is maintained for the chat. Maybe there'd be some benefit of doing this recursively? But that's also kind of what subagents are for. So, perhaps nothing new here.
If the numbers are represented with the most significant digit first as usual, you need a bunch of intermediate steps before outputting even the first digit just to determine whether it is affected by a carry or not.
The paper looks at multiplication of numbers represented with the least significant digit first as a toy task requiring several additions as intermediate steps to study why a model large enough to perform those additions in principle fails to learn to do so in practice.
They compare with a model that is first trained to produce the intermediate additions explicitly (as a "chain of thought" with a specific format) and then has this CoT progressively shortened during training until there's nothing left of it. But that second model successfully multiplies.
The difference appears to be that the presence of the intermediate results induces a better number representation in latent space, whereas the model without CoT gets stuck in a less efficient local minimum.
So the answer to the question "Why can't transformers learn multiplication?" is that the training process is insufficient for the model to discover the best intermediate steps on its own.
You could do a similar experiment where the CoT involves first taking the logarithm, adding, and then exponentiating to get the final result, but I think logarithms are probably another computation that's too difficult to learn without additional hints for intermediate steps.
I suppose you're probably right, but LLMs probably have a lot of log tables in their training data so I'm not so sure.
Also, it’s interesting that one of the big goals/measures of models is their capacity to “generalize”, but the training methods optimize for loss/accuracy, and only after training test for generalization to validate
Are there training methods/curriculums that explicitly maximize generalization?
i.e. enduring countless generations of evolutionary selection and cross breeding, then fine-tuning a bit?
although it could be interesting, i don't think training on progressively complex strings entirely recapitulates this.
I guess if you really wanted to start from scratch, you could figure out how to evolve the whole system from a single cell or something like that. In some ways neural networks have kind of evolved in that way, assisted by humans. They started with a single perceptron, and have gone all the way to deep learning and convolutional networks
I also remember a long time ago studying genetic and evolutionary algorithms, but they were pretty basic in terms of what they could learn and do, compared to modern LLMs
Although recently I saw some research in which they were applying essentially genetic algorithms to merge model weights and produce models with new/evolved capabilities
You don't need general intelligence to make a decent coding tool like Cursor.
You don't need general intelligence to improve SERPs.
You don't need general intelligence to sell a subscription for a decent AI assistant.
There's tons of value already added without anything general.
I mean, probably, LLMs as they are today are already changing the world. But I do think a lot of the ongoing investment is propped up on the promise of another breakthrough that is looking less likely.
Would be a far more accurate statement. Training != Learning.
Or will we need to produce a host of documents and (re)train a new one in order for the concept to be deeply integrated.
This distinction is subtle but lost on many who think that our current path will get us to AGI...
That isn't to say we haven't created a meaningful tool but the sooner we get candid and realistic about what it is and how it works the sooner we can get down to the business of building practical applications with it. (And as an aside scaling it, something we arent doing well with now).
Would a single human/entity learn more in ..say.. three million years or would short lived ones evolving over three million years and then ~20 years of education learn more?
The current AI tech cycle is focusing on the first, but we don't really know if there are benefits of both.
There's no obvious way to combine these yet.
(I’m specifically talking about commercial hosted ones that have the capability i describe - obviously your run of the mill one downloaded off of the internet cannot do this).
When I multiply, I take it in chunks.
Put the LLM into a loop, instruct it to keep track of where it is and have it solve a digit at a time.
I bet it does just fine. See my other comment as to why I think that is.
I tried to raise this question yesterday. https://news.ycombinator.com/item?id=45683113#45687769
Declaring victory on "reasoning" based on cherry-picking a correct result about arithmetic is, of course, very narrow and absurdly optimistic. Even if it correctly works for all NxM calculations. Moving on from arithmetic to any kind of problem that fundamentally reduces to model-checking behind the scenes.. we would be talking about exploring a state-space with potentially many thousands of state-transitions for simple stuff. If each one even has a small chance of crapping out due to hallucination, the chance of encountering errors at the macro-scale is going to be practically guaranteed.
Everyone will say, "but you want tool-use or code-gen for this anyway". Sure! But carry-digits or similar is just one version of "correct matters" and putting some non-local kinds of demands on attention, plus it's easier to check than code. So tool-use or code-gen is just pushing the same problem somewhere else to hide it.. there's still a lot of steps involved, and each one really has to be correct if the macro-layer is going to be correct and the whole thing is going to be hands-off / actually automated. Maybe that's why local-models can still barely handle nontrivial tool-calling.
Here we can see the amount of data a high end traditional non-SOC CPU holds:
> For a recent high-end non-SoC desktop CPU: > Cache: ~40-100 MB total (L1 + L2 + shared L3) > Register files: tens to few hundreds of KB total across cores (e.g., ~200-300 KB or so) > Combined: So you're looking at ~40-100 MB + ~0.2 MB → roughly ~40-100 MB of total on-chip caches + registers.
I'm sure we can reduce these caches to fit in the context windows of today's LLMs (~500,000 tokens).
Then, with temperature 0 we get more "discrete" operations. Now, we still have the rare problem of hallucinations, but it should be small with temperature 0.
And temperature 0 makes outputs deterministic, not magically correct.
For reasons I don't claim to really understand, I don't think it even makes them deterministic. Floating point something something? I'm not sure temperature even has a static technical definition or implementation everywhere at this point. I've been ignoring temperature and using nucleus sampling anywhere that's exposed and it seems to work better.
Random but typical example.. pydantic-ai has a caveat that doesn't reference any particular model: "Note that even with temperature of 0.0, the results will not be fully deterministic". And of course this is just the very bottom layer of model-config and in a system of diverse agents using different frameworks and models, it's even worse.
Your previous example shows the best case, which is a model can sometimes follow a textual recipe for long multiplication on short inputs. That's not the same as learning a length generalizing bit exact algorithm.
Basically what you shown is the model can describe the algorithm. It doesn't show it can execute it at scale. Without writable state and bit exact ops, errors grow with length and "focus more" only slows that failure, it doesn’t eliminate it.
In their prompt, they told it to leave itself a note and to accomplish something each time.
Then they put the model in a loop and it worked. In one instance, a model removed itself from the loop by editing a file or some other basic means.
To me, iterative tasks like like multiply and long divide, look an awful lot like the code port experiment.
Putting models into loops so they get more than one bite at the task seems to be a logical progression to improve capability.
The approach of “try a few more things before stopping” is a great strategy akin to taking a few more stabs at RNG. It’s not the same as saying keep trying until you get there - you won’t.
That's one hell of a criterion. Test-time inference undergoes a similar scaling law to pretraining, and has resulted in dramatically improved performance on many complex tasks. Law of diminishing returns kicks in of course, but this doesn't mean it's ineffective.
> akin to taking a few more stabs at RNG
Assuming I understand you correctly, I disagree. Scaling laws cannot appear with glassy optimisation procedures (essentially iid trials until you succeed, the mental model you seem to be implying here). They only appear if the underlying optimisation is globally connected and roughly convex. It's no different than gradient descent in this regard.
In this paper, the task is to learn how to multiply, strictly from AxB=C examples, with 4-digit numbers. Their vanilla transformer can't learn it, but the one with (their variant of) chain-of-thought can. These are transformers that have never encountered written text, and are too small to understand any of it anyway.
There is an inherent numeric-ness and logic to math that I don't think we can represent well using LLMs and transformers.
3 isn't about the word "three" - it is a quantity or a measurement. And 3x4 is a specific numerical operation that is not really contained in that sequence of symbols.
Maybe you could say that algebra is just symbol manipulation.
And in any case - "set of rules" is exactly what transformers aren't good at. Transformers are good at capturing the essence of what you meant and responding in a sensible, but not rule-bound way. This works well for language problems.
Perhaps you could argue that transformers are just a set of rules (weights/parameters) being applied, and you might similarly argue that numbers reduce to logical symbols like S(0), S(S(0)), but then I'd argue that you're missing the point.
Most languages and its stdlib's cannot deal with numbers properly at all. Most overflow without errors. Most integers cannot keep precision, most cannot promote types properly.
I only know of Common Lisp, Scheme, Python 3, Ruby, Erlang, Haskell, Raku which can handle numbers properly by default. Python extremely slow though.
LouisSayers•10h ago
IAmBroom•10h ago
Razengan•5h ago