Tokenization for language modeling: BPE vs. Unigram Language Modeling (2020)

https://ndingwall.github.io/blog/tokenization

75•phewlink•1d ago

Comments

anonymoushn•1d ago

How does SentencePiece choose the initial vocabulary, which is trimmed down to determine the final vocabulary which has these desirable properties?

mcyc•1d ago

Just a minor nit: SentencePiece is a library, not a tokenization algorithm. It implements two tokenization algorithms, Unigram and BPE.

BPE builds vocabularies from the base up so I assume you are talking about Unigram which starts with a big vocabulary and trims it.

The details of UnigramLM are here https://arxiv.org/pdf/1804.10959, and the part about vocabulary seeding is Section 3.2.

Basically, it just selects all substrings that appear in the corpus up to a certain length (and then maybe trims it a little by discarding rare substrings or something to reduce the initial size a bit and make things faster).

anonymoushn•1d ago

If the library has two vocabulary learners, only one of which does the described thing, then isn't it unambiguous which implementation within the library the question refers to? And wouldn't it be ambiguous to instead say "how does Unigram do it" without referring to any particular implementation?

Anyway, the paper says "Frequent substrings can be enumerated in O(T) time and O(20T) space with the Enhanced Suffix Array algorithm (Nong et al., 2009)", which is hilariously underspecified, at least in part because a suffix array algorithm isn't a top-k algorithm.

cschmidt•6h ago

It appears to be the top n-grams scored by the product of frequency and length. Including the frequency weighting is a bit nonstandard among ablative methods.

See line 233: https://github.com/google/sentencepiece/blob/master/src/unig...

I would suspect the n-gram counts don't cross pre-token boundaries, but I don't have time to find that in the code right now.

mcyc•1h ago

You can cross whitespace boundaries by setting flag `--split-on-whitespace` to false (it's true by default).

https://github.com/google/sentencepiece/blob/master/doc/opti...

wejick•1d ago

The vibe of AI discussions was so much different with today's (or at least for the past 2.5 years at least). It's quite surreal how things moved so fast

macawfish•1d ago

It's always seemed like such low hanging fruit, so much semantic information just squandered in the thirst for larger models.

kevmo314•1d ago

Unfortunately the choice of tokenizer is baked into the model. If you want to innovate on the tokenizer, you have to train a whole base model yourself to prove it's better which makes the barrier for innovation pretty high.

mdaniel•1d ago

Apologies if this is a dumb question, but is there no "hello world"-ish sandbox for testing this theory? I can very easily imagine that trying to go head-to-head with R1 or such is going to be a boatload of GPU, but for just testing tokenizer head-to-head isn't there a smaller sized one that can be used in a bake off?

anonymoushn•1d ago

You can use modded-nanogpt for testing this sort of change. If you don't want to do anything really weird, you can just train whatever tokenizer you like on the training set, then retokenize the input, then run the existing training code. One person did this earlier this year using a Tokenmonster vocab and got better downstream performance in less train time.

pama•1d ago

What exact attempt with which Tokenmonster vocab do you refer to? Sometimes it is hard to conclude much from these efforts. For example, having a smaller vocabulary is typically only useful for small models where the compute cost of the softmax layer at the end of the decoder may still factor into the equation for the performance. Fixing the size of the vocabulary while increasing the rest of the model makes this inefficiency disappear.

anonymoushn•1d ago

https://x.com/alexjc/status/1881410039639863622

I agree that it would be more useful to compare vocabs of identical size.

pama•1d ago

Thanks! It seems to me that the performance gain here was due to a smaller vocab size. This type of change is almost guaranteed to backfire for larger models and larger datasets / lower loss requirements, so it probably is not very useful. Generally the historical trend has been to use larger vocabularies as the models got larger themselves.

anonymoushn•22h ago

Well, he says he achieved some downstream win on the same size, but it didn't translate into a win in perplexity, so he tried to do something else. Like I said, it's unfortunate.

anonymoushn•11h ago

I actually wonder if he could just claim a win by calculating validation set BPB for both equally-sized vocabs instead of targeting the same level of perplexity as in the speedrun finish line lol

z3c0•23h ago

To expand on the other comment, if you look under the data folder in nanoGPT, you can see examples of how to train the model using various data sources and encoders. "shakespeare_char" is probably the most rudimentary, only converting the characters of the input into integers.

e.g. https://github.com/karpathy/nanoGPT/blob/master/data/shakesp...

yuvalpinter•12h ago

We've been working on this problem for quite some time in my lab. We released a benchmark piling together several "intrinsic evaluations" that don't require model training. We're currently investigating correlations between performance on this benchmark and on downstream tasks. Here it is - https://github.com/MeLeLBGU/tokenizers_intrinsic_benchmark - where there's a link to the paper we introduced it in, used for checking how inference schemes work together with various token vocabs. It's for English, but cited work and citing work have some of these for other languages as well.

pbronez•1d ago

Is there a more current post with similar information? I’d love to see how contemporary tokenizers have improved on these algorithms.

anonymoushn•1d ago

They have not.

woodson•23h ago

There has been some innovation, e.g. SuperBPE (https://arxiv.org/abs/2503.13423) claims substantial gains on 8B-sized models.

yuvalpinter•12h ago

A few years ago we released SaGe, which is a contextual tokenizer, meaning that it builds a vocab that's fit for an LM use case because tokens are selected to appear within as clear a context set as possible in a corpus. https://github.com/MeLeLBGU/SaGe It does better than both BPE and UnigramLM in several benchmarks. https://aclanthology.org/2024.acl-short.73/

anonymoushn•12h ago

Cool, I can have a look. By the way, I've filed a PR against the pathpiece repository correcting an error in the docs.

I have read this paper before, and now that I have looked more closely I am a bit disappointed that the precise method of construction of the large initial vocabularies used for vocabulary learners that begin with large initial vocabularies in the experiment is not specified. I also notice that the same error I have proposed correcting in the docs is present in the paper in section 3.1.

Additional edit: Alright, I see that A.2.1 specifies that the "n-gram" initial vocab condition means the most frequent 2^18 n-grams for n in 1..L. This seems like not very many, and not weighting n-grams by "potential CTC reduction" (=length-1) for inclusion in this initial vocab seems to create a bias against long n-grams. I'm also sort of wary of these greedy learners, because many candidates for inclusion in the vocab are interdependent with other candidates (the marginal loss or gain of a candidate may be near 0 because of the presence of some other token which may not actually be optimal to include, or there may be groups of included tokens which cause other better groups to have near 0 marginal loss or gain), so I would really want to avoid being too strongly path-dependent in the learner. If I may hand wave a great deal, I think this sort of issue also arises if your objective function is not CTC.

cschmidt•6h ago

Somehow I didn't get any notifications of your PR. Sorry about that. I'll take a look.

cschmidt•6h ago

Co-author of the PathPiece paper here.

With regard to weighting the n-grams by length*frequency, I'm not sure it is clear that that would be better. The SentencePiece unigram model does it that way (as I mentioned in another comment), and hence, unigram produces longer tokens on average. It is generally considered that this is a bit of an issue with unigram. Not that there is particular evidence either way, as with many things in tokenization.

Why do you think 2^18 initial n-grams is too few? That's 5.3 times more than the largest vocab we train.

anonymoushn•5h ago

I think that the ideal number of initial n-grams would be large enough that adding additional initial n-grams has no effect on the output, because I expect to not be very good at tuning two different knobs.

cschmidt•6h ago

Regarding $O(n L^2)$ vs $O(n L)$, that was because we somewhat sloppily tend to use the term 'tokenization' for both training a tokenizer vocab, and for tokenizing a given document. In the paper, we tried to always call the latter one segmentation or inference. The former is $O(n L^2)$ per iteration, while the latter $O(n L)$. I'll update the README to be more explicit about this.

anonymoushn•5h ago

No, the segmentation algorithm you have implemented has runtime O(N*L^2). In particular, if you want to do a hash table lookup using a string key, that takes time proportional to the length of the string, not constant time.

cschmidt•14m ago

That's in interesting point. While your correct, of course, it is so common to consider a hash table lookup a O(1) operation, it never occurred to me. But in this case, the loops are actually really tight and the hash table lookup might be a significant part of the time, so it might well behave more like O(n L^2). I'll update the docs and paper.

janalsncm•1d ago

Interesting recent research from Meta going in what I would call the opposite direction from an interpretability standpoint: https://arxiv.org/abs/2412.09871

Rather than splitting based on a predefined list you can check with a text editor, BLTs use a neural net to tokenize. So it will be pretty hard to debug.

YossarianFrPrez•22h ago

This is a cool paper!

Basically, prior to feeding text to the tokenizer people have split the text on whitespaces. But whitespaces aren't exactly meaningful separators. By getting rid of this restriction as the tokenizer is 'learning', some of the tokens end up being 'by the way' or 'in the the long run.' The researchers find that this makes the model much more efficient.

yuvalpinter•12h ago

This can also be done within the tokenization framework, see our work here: https://arxiv.org/abs/2504.00178

woodson•3h ago

How does this differ from SuperBPE, which seems to pursue a similar goal? https://arxiv.org/abs/2503.13423

Looks like parallel invention. (I’m not associated with the paper or its authors.)

anonymoushn•1h ago

In SuperBPE, a fixed number of tokens are learned, and then the constraints of pretokenization are removed entirely, and then the remainder of the target vocab size is learned.

In Boundless BPE, no schedule must be chosen, because there is not any point at which the constraints of pretokenization are removed entirely. Instead, at any point in the learning process, merges between adjacent pretokens are permitted if the pretokens are each represented by a single token. There are some additional details about how the authors incorporate Picky BPE, which I will not try to repeat because I would probably get them wrong.

cschmidt•19m ago

Yes, they were concurrent work. (Co-author of BoundlessBPE here). A sibling comment describes the main differences. Our paper motivates why superwords can lead to such a big improvement, by overcoming a limit that pre-tokenization imposes on current tokenization methods. The SuperBPE paper has a wonderful set of downstream evaluation runs. So if you're interested in either, they are quite complimentary papers.

mrkramer•1d ago

So LLMs hallucinate by design which is of course bad for any serious work but then again I was thinking in which situations would hallucinations be actually useful....let's say poetry writing, lyrics writing etc., the crazier the better or for generating business ideas, "crazy ideas succeed" type logic.

Lerc•22h ago

While training something on TinyStories one of the early stage outputs was

"Once upon a time, in a small town, there was a huge town"

Which I thought was the brilliant start for an interesting story. Just that slight deviation from expectation provides a excellent scaffold for your imagination to take hold.

Of course the rest of the generation was utter rubbish, but if you could on-demand deviate from standard at key points while keeping the majority internally consistent then it would be great.

mrkramer•21h ago

Poetry is full of figures of speech which are words and phrases that intentionally differ from common logic and common knowledge so LLMs can actually be probabilistic and accidental generators of figures of speech. Every nonsense that LLMs generate can be interpreted as a protentional metaphor or an oxymoron. The sentence that you mentioned sounds like a badass metaphor to me "Once upon a time, in a small town, there was a huge town" or in another words: town might be small in size but it has soul.

Lerc•23h ago

I have been doing experiments along these lines. Initially I have avoided the issue with the semantics of words altogether because there is plenty of work to be done before you even get to the words themselves.

With BPE you get different encodings based upon punctuation and whitespace whenever it decides to pair the characters before and after a word into the word itself.

The tokenizer I have been using breaks words into components of:

   prefix character
   prefix character repeat count
   index of lowercase(word)
   index of capialization mode  (0 = all lower, 1 = initial cap, 2= all caps)
   suffix character

The encoder encodes the character before and after the word double encoding the information into the words previous and next, the decoder decreases the prefix count by one it it matches the suffix of the previous word. Anything that doesn't match a known word or is weirdly capitalized gets character encoded as a series of prefix-suffix characters with no word body. (a separate channel for chars would probably be better)

Each channel gets their own embedding table and the embeddings from all the channels are summed before being passed into the transformer.

Decode is just a separate last stage layer translating the final embedding into channels. In hindsight it should have been a little more than that for the decode because if the options for the next word were split between " YELLED!" and " whispered." my current system could theoretically produce " WHISPERED.". In practice it doesn't seem to do that much, but that means it's had to learn something to deal with that (I suspect by limiting variation) adding a little smarts to the end tokenization would help, perhaps choose the word index first then use the embedding for the match to filter the predicted embedding before calculating the other channels.

I have not yet done anything on breaking words themselves up. I have been using tinystories for training so there hasn't been need for it with so few unique words. I have given it it a lot of thought though and I think I would contest the 'Gold standard' encodings. I think a word like 'nodular' should be encoded as something like 'nodule' '<of or relating to modifier> <minus e> ar'

It's a little hard to comprehend what this might be for other languages, but I think there is probably insights to be had if you tried to make something that encoded English, Latin and Kanji equally well.

I'd be curious to know what the total number of homonyms there are across languages, too. Just a single signal to say 'this one is a known homonym' would probably be beneficial. If the total number is low enough, having their own token range might work too.

Precision Clock Mk IV

A Lean companion to Analysis I

Oxfordshire clock still keeping village on time after 500 years

Show HN: PunchCard Key Backup

We're beating $359M in funding with two people and OCaml

Photos taken inside musical instruments

AtomVM, the Erlang virtual machine for IoT devices

The Two Ideals of Fields

Using Ed(1) as My Static Site Generator

Show HN: Fontofweb – Discover Fonts Used on a Website or Websites Using Font(s)

AI video you can watch and interact with, in real-time

Beware of Fast-Math

Designing Pareto-optimal RAG workflows with syftr

Using lots of little tools to aggressively reject the bots

Gradients Are the New Intervals

Acclimation of Osmoregulatory Function in Salmon

Webb telescope helps refines Hubble constant, suggesting resolution rate debate

Atlas: Learning to Optimally Memorize the Context at Test Time

Surprisingly fast AI-generated kernels we didn't mean to publish yet

Show HN: AI Peer Reviewer – Multiagent System for Scientific Manuscript Analysis

Show HN: I built an AI agent that turns ROS 2's turtlesim into a digital artist

Exploring a Language Runtime with Bpftrace

The Illusion of Causality in Charts

The ‘white-collar bloodbath’ is all part of the AI hype machine

The Trackers and SDKs in ChatGPT, Claude, Grok and Perplexity

C++ to Rust Phrasebook

Beating Google's kernelCTF PoW using AVX512

Microsandbox: Virtual Machines that feel and perform like containers

Pure vs. Impure Iterators in Go

AccessOwl (YC S22) is hiring an AI TypeScript Engineer to connect 100s of SaaS

Precision Clock Mk IV

A Lean companion to Analysis I

Oxfordshire clock still keeping village on time after 500 years

Show HN: PunchCard Key Backup

We're beating $359M in funding with two people and OCaml

Photos taken inside musical instruments

AtomVM, the Erlang virtual machine for IoT devices

The Two Ideals of Fields

Using Ed(1) as My Static Site Generator

Show HN: Fontofweb – Discover Fonts Used on a Website or Websites Using Font(s)

AI video you can watch and interact with, in real-time

Beware of Fast-Math

Designing Pareto-optimal RAG workflows with syftr

Using lots of little tools to aggressively reject the bots

Gradients Are the New Intervals

Acclimation of Osmoregulatory Function in Salmon

Webb telescope helps refines Hubble constant, suggesting resolution rate debate

Atlas: Learning to Optimally Memorize the Context at Test Time

Surprisingly fast AI-generated kernels we didn't mean to publish yet

Show HN: AI Peer Reviewer – Multiagent System for Scientific Manuscript Analysis

Show HN: I built an AI agent that turns ROS 2's turtlesim into a digital artist

Exploring a Language Runtime with Bpftrace

The Illusion of Causality in Charts

The ‘white-collar bloodbath’ is all part of the AI hype machine

The Trackers and SDKs in ChatGPT, Claude, Grok and Perplexity

C++ to Rust Phrasebook

Beating Google's kernelCTF PoW using AVX512

Microsandbox: Virtual Machines that feel and perform like containers

Pure vs. Impure Iterators in Go

AccessOwl (YC S22) is hiring an AI TypeScript Engineer to connect 100s of SaaS

Tokenization for language modeling: BPE vs. Unigram Language Modeling (2020)

Comments