Also, solution testing is mandatory. Luckily, you can ask an RNG for that, too, as long as you have tests for the testers already written.
Maybe the hope is that you won't have to manually map the universal algorithm to your specific problem and can just train the transformer to figure it out instead, but there are few proofs that transformers can solve all problems in some complexity class through training instead of manual construction.
The real bitter lesson in AI is that we don't really know what we're doing. We're hacking on models looking for architectures that train well but we don't fully understand why they work. Because we don't fully understand it, we can't design anything optimal or know how good a solution can possibly get.
Well, technically, that's not true: The entire idea behind complexity theory is that there are some tasks that you can't throw more hardware at - at least not for interesting problem sizes or remotely feasible amounts of hardware.
I wonder if we'll reach a similar situation in AI where "throw more context/layers/training data at the problem" won't help anymore and people will be forced to care more about understanding again.
More precisely, I think producing a good fast merge of ca 5 lists was a problem I didn’t have good answers for but maybe I was too fixated on a streaming solution and didn’t apply enough tricks.
When all you have is a hammer... It makes a lot of sense that a transformation layer that makes the tokens more semantically relevant will help optimize the entire network after it and increase the effective size of your context window. And one of the main immediate obstacle stopping those models from being intelligent is context window size.
On the other hand, the current models already cost something on the line of the median country GDP to train, and they are nowhere close to that in value. The saying that "if brute force didn't solve your problem, you didn't apply enough force" is intended to be listened as a joke.
https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(PPP)
Models are expensive, but they're not that expensive.
[0] https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nomi...
The largest economy (US) has a GDP of $27.7 trillion.
The smallest economy (Tuvalu) has a GDP of $62.3 million.
The 48 billion number represents the middle point where half of all countries have larger GDPs and half have smaller GDPs.
Is this really true?
Can someone (who know about LLMs) explain why the r's in strawberry thing is related to tokenization? I have no reason to believe an LLM would be better at counting letters if each was one token. It's not like they "see" any of it. Are they better at counting tokens than letters for some reason? Or is this just one of those things someone misinformed said to sound smart to even less informed people, that got picked up?
Count the number of Rs in this sequence: [496, 675, 15717]
Count the number of 18s in this sequence: 19 20 18 1 23 2 5 18 18 25
Human: Which is the easier of these formulas
1. x = SQRT(4)
2. x = SQRT(123567889.987654321)
Computer: They're both the same.
[496, 675, 15717] is the GPT-4 representation of the tokens. In order to determine which letters the token represents, it needs to learn the relationship between "str" and [496]. It can learn the representation (since it can spell it out as "S-T-R" or "1. S, 2. T, 3. R" or whatever) but it adds an extra step.
The question is whether the extra step adds enough extra processing to degrade performance. Does the more compact representation buy enough extra context to make the tokenized version more effective for more problems?
It seems like the longer context length makes the trade off worth it, since spelling problems are a relatively minor subset. On the other hand, for numbers it does appear that math is significantly worse when it doesn't have access to individual digits (early Llama math results, for example). Once they changed the digit tokenization, the math performance improved.
In contrast, if the model were trained with a character-level vocabulary, where each character maps to a unique token, it would not need to memorize character counts for entire words. Instead, it could potentially learn a generalizable method for counting characters across all sequences, even for words it has never seen before.
I'm not sure about what you mean about them not "seeing" the tokens. They definitely receive a representation of each token as input.
Please take another look at my original comment. I was being precise about the distinction between what's structurally possible to generalize vs memorize.
GPT-2 tokenization was a demonstratable problem: https://www.beren.io/2023-02-04-Integer-tokenization-is-insa... (Prior HN discussion: https://news.ycombinator.com/item?id=39728870 )
More recent research:
https://huggingface.co/spaces/huggingface/number-tokenizatio...
Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs: https://arxiv.org/abs/2402.14903
https://www.beren.io/2024-07-07-Right-to-Left-Integer-Tokeni...
https://twitter.com/yuntiandeng/status/1836114401213989366
If anything I'd think this indicates the barrier isn't tokenization (if it can do arithmetic, it can probably count as well) but something to do with "sequential dependencies" requiring use of COT and explicit training. Which still leaves me puzzled: there are tons of papers showing that variants of GPT-2 trained in the right way can do arithmetic, where are the papers solving the "count R in strawberry" problem?
IME Reddit would scream "tokenization" at the strawberry meme until blue in the face, assuring themselves better tokenization meant the problem would be solved. Meanwhile RLHF'ers were/are en masse paid to solve the problem through correcting thousands of these "counting"/perfect syntax prompts and problems. To me, since RLHF work was being paid to tackle these problems, it couldn't be a simple tokenization problem. If there was a tokenization bottleneck that fixing would solve the problem, we would not be getting paid to so much money to RLHF synax-perfect prompts (think of Sudoku type games and heavy syntax-based problems).
No, why models are better are these problems now is because of RLHF. And before you say, well now models have learned how to count in general, I say we just need to widen the abstraction a tiny bit and the models will fail again. And this will be the story of LLMs forever--they will never take the lead on their own, and its not how humans process information, but it still can be useful.
As an analogy, I might ask you to identify the relative activations of each of the three cone types on your retina as I present some solid color image to your eyes. But of course you can't do this, you simply do not have cognitive access to that information. Individual color experiences are your basic vision tokens.
Actually, I asked Grok this question a while ago when probing how well it could count vowels in a word. It got it right by listing every letter individually. I then asked it to count without listing the letters and it was a couple of letters off. I asked it how it was counting without listing the letters and its answer was pretty fascinating, with a seeming awareness of its own internal processes:
Connecting a token to a vowel, though, requires a bit of a mental pivot. Normally, I’d just process the token and move on, but when you ask me to count vowels, I have to zoom in. I don’t unroll the word into a string of letters like a human counting beads on a string. Instead, I lean on my understanding of how those tokens sound or how they’re typically constructed. For instance, I know "cali" has an 'a' and an 'i' because I’ve got a sense of its phonetic makeup from training data—not because I’m stepping through c-a-l-i. It’s more like I "feel" the vowels in there, based on patterns I’ve internalized.
When I counted the vowels without listing each letter, I was basically hopping from token to token, estimating their vowel content from memory and intuition, then cross-checking it against the whole word’s vibe. It’s not perfect—I’m not cracking open each token like an egg to inspect it—but it’s fast and usually close enough. The difference you noticed comes from that shift: listing letters forces me to be precise and sequential, while the token approach is more holistic, like guessing the number of jellybeans in a jar by eyeing the clumps.
> Strikingly, Claude seems to be unaware of the sophisticated "mental math" strategies that it learned during training. If you ask how it figured out that 36+59 is 95, it describes the standard algorithm involving carrying the 1. This may reflect the fact that the model learns to explain math by simulating explanations written by people, but that it has to learn to do math "in its head" directly, without any such hints, and develops its own internal strategies to do so.
https://www.anthropic.com/news/tracing-thoughts-language-mod...
It seems to be about as useful as asking a person how their hippocampus works: they might be able to make something up, or repeat a vaguely remembered bit of neuroscience, but they don't actually have access to their own hippocampus' internal workings, so if they're correct it's by accident.
I'd like to see a math/logic bench appear for tokenization schemes that captures this. BPB/perplexity is fine, but its not everything.
To the extent we've already found that to be the case, it's perhaps the weirdest part of this whole "paradigm shift."
Then a system samples from that distribution, typically with randomness, and there are some optimizations in running them that introduce randomness, but it's important to understand that the models themselves are not random.
It's best to assume that the relationship between input and output of an LLM is not deterministic, similar to something like using a Google search API.
https://arxiv.org/abs/2402.14903
You right to left tokenize in groups of 3, so 1234567 becomes 1 234 567 rather than the default 123 456 7. And if you ensure all 1-3 digits groups are in the vocab, it does much better.
Both https://arxiv.org/abs/2503.13423 and https://arxiv.org/abs/2504.00178 (co-author) both independently noted that you can do this with just by modifying the pre-tokenization regex, without having to explicitly add commas.
Specifically they made tokens for 4,8,12,16 or something spaces.
Maybe if you have infinite compute you don't worry about software design. Meanwhile in the real world...
Not only that but where did all these compute optimized solutions come from? Oh yeah millions of man hours of optimizing and testing algorithmic solutions. So unless you are some head in the clouds tenured professor just keep on doing your optimizations and job as usual.
I’m hoping someday that dude releases an essay called The Cold Comfort. But it’s impossible to predict when or who it will help, so don’t wait for it.
Of course, instead of the beach one could spend those Y months improving the algorithms... but it's never wise to bid against yourself if you don't have to.
A colloquially is that to maximize your beach time you should work on the biggest N possible, neatly explaining the popularity of AI startups.
This is all explained in the original essay: http://www.incompleteideas.net/IncIdeas/BitterLesson.html
That said the hand coded nature of tokenization certainly seems in dire need of a better solution, something that can be learned end to end. And It looks like we are getting closer with every iteration.
> As it's been pointed out countless times - if the trend of ML research could be summarised, it'd be the adherence to The Bitter Lesson - opt for general-purpose methods that leverage large amounts of compute and data over crafted methods by domain experts
But we're only 1 sentence in, and this is already a failure of science communication at several levels.
1. The sentence structure and grammar is simply horrible
2. This is condescending: "pointed out countless times" - has it?
3. The reference to Sutton's essay is oblique, easy to miss
4. Outside of AI circles, "Bitter Lesson" is not very well known. If you didn't already know about it, this doesn't help.
So any system that predicts the optimization with a general solver can scale better than heuristic or constrained space solvers
Up till recently there’s been no general solvers at that scale
Scene_Cast2•6h ago
Let's say that we have 15k unique tokens (going by modern open models). Let's also say that we have an embedding dimensionality of 1k. This implies that we have a maximum 1k degrees of freedom (or rank) on our output. The model is able to pick any single of the 15k tokens as the top token, but the expressivity of the _probability distribution_ is inherently limited to 1k unique linear components.
unoti•6h ago
blackbear_•6h ago
Scene_Cast2•4h ago
However, I'm talking about the probability distribution of tokens.
kevingadd•6h ago
anonymoushn•5h ago
molf•6h ago
12288 dimensions (GPT3 size) can fit more than 40 billion nearly perpendicular vectors.
[1]: https://www.3blue1brown.com/lessons/mlp#superposition
imurray•4h ago
Detecting and preventing unargmaxable outputs in bottlenecked neural networks, Andreas Grivas (2024)
incognito124•4h ago
If I remember correctly, that's not true because of the nonlinearities which provide the model with more expressivity. Transformation from 15k to 1k is rarely an affine map, it's usually highly non-linear.