Tokenization for language modeling: BPE vs. Unigram Language Modeling (2020)

https://ndingwall.github.io/blog/tokenization

74•phewlink•1d ago

Comments

anonymoushn•1d ago

How does SentencePiece choose the initial vocabulary, which is trimmed down to determine the final vocabulary which has these desirable properties?

mcyc•1d ago

Just a minor nit: SentencePiece is a library, not a tokenization algorithm. It implements two tokenization algorithms, Unigram and BPE.

BPE builds vocabularies from the base up so I assume you are talking about Unigram which starts with a big vocabulary and trims it.

The details of UnigramLM are here https://arxiv.org/pdf/1804.10959, and the part about vocabulary seeding is Section 3.2.

Basically, it just selects all substrings that appear in the corpus up to a certain length (and then maybe trims it a little by discarding rare substrings or something to reduce the initial size a bit and make things faster).

anonymoushn•1d ago

If the library has two vocabulary learners, only one of which does the described thing, then isn't it unambiguous which implementation within the library the question refers to? And wouldn't it be ambiguous to instead say "how does Unigram do it" without referring to any particular implementation?

Anyway, the paper says "Frequent substrings can be enumerated in O(T) time and O(20T) space with the Enhanced Suffix Array algorithm (Nong et al., 2009)", which is hilariously underspecified, at least in part because a suffix array algorithm isn't a top-k algorithm.

cschmidt•2h ago

It appears to be the top n-grams scored by the product of frequency and length. Including the frequency weighting is a bit nonstandard among ablative methods.

See line 233: https://github.com/google/sentencepiece/blob/master/src/unig...

I would suspect the n-gram counts don't cross pre-token boundaries, but I don't have time to find that in the code right now.

wejick•1d ago

The vibe of AI discussions was so much different with today's (or at least for the past 2.5 years at least). It's quite surreal how things moved so fast

macawfish•1d ago

It's always seemed like such low hanging fruit, so much semantic information just squandered in the thirst for larger models.

kevmo314•1d ago

Unfortunately the choice of tokenizer is baked into the model. If you want to innovate on the tokenizer, you have to train a whole base model yourself to prove it's better which makes the barrier for innovation pretty high.

mdaniel•1d ago

Apologies if this is a dumb question, but is there no "hello world"-ish sandbox for testing this theory? I can very easily imagine that trying to go head-to-head with R1 or such is going to be a boatload of GPU, but for just testing tokenizer head-to-head isn't there a smaller sized one that can be used in a bake off?

anonymoushn•1d ago

You can use modded-nanogpt for testing this sort of change. If you don't want to do anything really weird, you can just train whatever tokenizer you like on the training set, then retokenize the input, then run the existing training code. One person did this earlier this year using a Tokenmonster vocab and got better downstream performance in less train time.

pama•22h ago

What exact attempt with which Tokenmonster vocab do you refer to? Sometimes it is hard to conclude much from these efforts. For example, having a smaller vocabulary is typically only useful for small models where the compute cost of the softmax layer at the end of the decoder may still factor into the equation for the performance. Fixing the size of the vocabulary while increasing the rest of the model makes this inefficiency disappear.

anonymoushn•21h ago

https://x.com/alexjc/status/1881410039639863622

I agree that it would be more useful to compare vocabs of identical size.

pama•20h ago

Thanks! It seems to me that the performance gain here was due to a smaller vocab size. This type of change is almost guaranteed to backfire for larger models and larger datasets / lower loss requirements, so it probably is not very useful. Generally the historical trend has been to use larger vocabularies as the models got larger themselves.

anonymoushn•18h ago

Well, he says he achieved some downstream win on the same size, but it didn't translate into a win in perplexity, so he tried to do something else. Like I said, it's unfortunate.

anonymoushn•7h ago

I actually wonder if he could just claim a win by calculating validation set BPB for both equally-sized vocabs instead of targeting the same level of perplexity as in the speedrun finish line lol

z3c0•19h ago

To expand on the other comment, if you look under the data folder in nanoGPT, you can see examples of how to train the model using various data sources and encoders. "shakespeare_char" is probably the most rudimentary, only converting the characters of the input into integers.

e.g. https://github.com/karpathy/nanoGPT/blob/master/data/shakesp...

yuvalpinter•8h ago

We've been working on this problem for quite some time in my lab. We released a benchmark piling together several "intrinsic evaluations" that don't require model training. We're currently investigating correlations between performance on this benchmark and on downstream tasks. Here it is - https://github.com/MeLeLBGU/tokenizers_intrinsic_benchmark - where there's a link to the paper we introduced it in, used for checking how inference schemes work together with various token vocabs. It's for English, but cited work and citing work have some of these for other languages as well.

pbronez•1d ago

Is there a more current post with similar information? I’d love to see how contemporary tokenizers have improved on these algorithms.

anonymoushn•1d ago

They have not.

woodson•19h ago

There has been some innovation, e.g. SuperBPE (https://arxiv.org/abs/2503.13423) claims substantial gains on 8B-sized models.

yuvalpinter•8h ago

A few years ago we released SaGe, which is a contextual tokenizer, meaning that it builds a vocab that's fit for an LM use case because tokens are selected to appear within as clear a context set as possible in a corpus. https://github.com/MeLeLBGU/SaGe It does better than both BPE and UnigramLM in several benchmarks. https://aclanthology.org/2024.acl-short.73/

anonymoushn•8h ago

Cool, I can have a look. By the way, I've filed a PR against the pathpiece repository correcting an error in the docs.

I have read this paper before, and now that I have looked more closely I am a bit disappointed that the precise method of construction of the large initial vocabularies used for vocabulary learners that begin with large initial vocabularies in the experiment is not specified. I also notice that the same error I have proposed correcting in the docs is present in the paper in section 3.1.

Additional edit: Alright, I see that A.2.1 specifies that the "n-gram" initial vocab condition means the most frequent 2^18 n-grams for n in 1..L. This seems like not very many, and not weighting n-grams by "potential CTC reduction" (=length-1) for inclusion in this initial vocab seems to create a bias against long n-grams. I'm also sort of wary of these greedy learners, because many candidates for inclusion in the vocab are interdependent with other candidates (the marginal loss or gain of a candidate may be near 0 because of the presence of some other token which may not actually be optimal to include, or there may be groups of included tokens which cause other better groups to have near 0 marginal loss or gain), so I would really want to avoid being too strongly path-dependent in the learner. If I may hand wave a great deal, I think this sort of issue also arises if your objective function is not CTC.

cschmidt•2h ago

Somehow I didn't get any notifications of your PR. Sorry about that. I'll take a look.

cschmidt•2h ago

Co-author of the PathPiece paper here.

With regard to weighting the n-grams by length*frequency, I'm not sure it is clear that that would be better. The SentencePiece unigram model does it that way (as I mentioned in another comment), and hence, unigram produces longer tokens on average. It is generally considered that this is a bit of an issue with unigram. Not that there is particular evidence either way, as with many things in tokenization.

Why do you think 2^18 initial n-grams is too few? That's 5.3 times more than the largest vocab we train.

anonymoushn•1h ago

I think that the ideal number of initial n-grams would be large enough that adding additional initial n-grams has no effect on the output, because I expect to not be very good at tuning two different knobs.

cschmidt•2h ago

Regarding $O(n L^2)$ vs $O(n L)$, that was because we somewhat sloppily tend to use the term 'tokenization' for both training a tokenizer vocab, and for tokenizing a given document. In the paper, we tried to always call the latter one segmentation or inference. The former is $O(n L^2)$ per iteration, while the latter $O(n L)$. I'll update the README to be more explicit about this.

anonymoushn•1h ago

No, the segmentation algorithm you have implemented has runtime O(N*L^2). In particular, if you want to do a hash table lookup using a string key, that takes time proportional to the length of the string, not constant time.

janalsncm•22h ago

Interesting recent research from Meta going in what I would call the opposite direction from an interpretability standpoint: https://arxiv.org/abs/2412.09871

Rather than splitting based on a predefined list you can check with a text editor, BLTs use a neural net to tokenize. So it will be pretty hard to debug.

YossarianFrPrez•18h ago

This is a cool paper!

Basically, prior to feeding text to the tokenizer people have split the text on whitespaces. But whitespaces aren't exactly meaningful separators. By getting rid of this restriction as the tokenizer is 'learning', some of the tokens end up being 'by the way' or 'in the the long run.' The researchers find that this makes the model much more efficient.

yuvalpinter•8h ago

This can also be done within the tokenization framework, see our work here: https://arxiv.org/abs/2504.00178

mrkramer•20h ago

So LLMs hallucinate by design which is of course bad for any serious work but then again I was thinking in which situations would hallucinations be actually useful....let's say poetry writing, lyrics writing etc., the crazier the better or for generating business ideas, "crazy ideas succeed" type logic.

Lerc•18h ago

While training something on TinyStories one of the early stage outputs was

"Once upon a time, in a small town, there was a huge town"

Which I thought was the brilliant start for an interesting story. Just that slight deviation from expectation provides a excellent scaffold for your imagination to take hold.

Of course the rest of the generation was utter rubbish, but if you could on-demand deviate from standard at key points while keeping the majority internally consistent then it would be great.

mrkramer•17h ago

Poetry is full of figures of speech which are words and phrases that intentionally differ from common logic and common knowledge so LLMs can actually be probabilistic and accidental generators of figures of speech. Every nonsense that LLMs generate can be interpreted as a protentional metaphor or an oxymoron. The sentence that you mentioned sounds like a badass metaphor to me "Once upon a time, in a small town, there was a huge town" or in another words: town might be small in size but it has soul.

Lerc•19h ago

I have been doing experiments along these lines. Initially I have avoided the issue with the semantics of words altogether because there is plenty of work to be done before you even get to the words themselves.

With BPE you get different encodings based upon punctuation and whitespace whenever it decides to pair the characters before and after a word into the word itself.

The tokenizer I have been using breaks words into components of:

   prefix character
   prefix character repeat count
   index of lowercase(word)
   index of capialization mode  (0 = all lower, 1 = initial cap, 2= all caps)
   suffix character

The encoder encodes the character before and after the word double encoding the information into the words previous and next, the decoder decreases the prefix count by one it it matches the suffix of the previous word. Anything that doesn't match a known word or is weirdly capitalized gets character encoded as a series of prefix-suffix characters with no word body. (a separate channel for chars would probably be better)

Each channel gets their own embedding table and the embeddings from all the channels are summed before being passed into the transformer.

Decode is just a separate last stage layer translating the final embedding into channels. In hindsight it should have been a little more than that for the decode because if the options for the next word were split between " YELLED!" and " whispered." my current system could theoretically produce " WHISPERED.". In practice it doesn't seem to do that much, but that means it's had to learn something to deal with that (I suspect by limiting variation) adding a little smarts to the end tokenization would help, perhaps choose the word index first then use the embedding for the match to filter the predicted embedding before calculating the other channels.

I have not yet done anything on breaking words themselves up. I have been using tinystories for training so there hasn't been need for it with so few unique words. I have given it it a lot of thought though and I think I would contest the 'Gold standard' encodings. I think a word like 'nodular' should be encoded as something like 'nodule' '<of or relating to modifier> <minus e> ar'

It's a little hard to comprehend what this might be for other languages, but I think there is probably insights to be had if you tried to make something that encoded English, Latin and Kanji equally well.

I'd be curious to know what the total number of homonyms there are across languages, too. Just a single signal to say 'this one is a known homonym' would probably be beneficial. If the total number is low enough, having their own token range might work too.

Concerning Levels of Acoustic Spying Techniques [video]

Ask HN: Dashboard for X Stock Sentiment. What features would traders want?

Ask HN: Diff in jitted versus compiled code perf

Realtek eyes SSDs with new PCIe 5.0 x4 DRAM-less controller

WhatsApp will stop working on these iPhones starting June 1

From Typewriters to Transformers: AI Is Just the Next Tools Abstraction

X hits pause on its encrypted DMs feature

Cointelpro: The Untold American Story [pdf]

A misplaced MRI found a tumor on her spine. Doctors removed it through her eye

Why Are No Ultra Large Container Vessels Sailing to the United States (2024) [video]

Dijkstra never took a biology course

Verifying Bots and Agents with Cryptography in the Age of AI

Show HN: Live network map display with engraved backlighed panels

Paid maternity leave policies could be costing women tech jobs

TrojanStego: Your Language Model Can Be a Steganographic Agent

The R Project: A Brief History and Thoughts About the Future [pdf]

How To Glow Up/How to be popular?

It's Waymo's World. We're All Just Riding in It

Satellite mega-swarms are blinding us to the cosmos

Continuous Neural Networks: A Physics-Inspired Framework

Anthropic's Circuit Tracer

Precision Clock Mk IV

Why Intellectuals are F*cking Idiots [video]

The Rise of the Japanese Toilet

Pricing Strategy

Microsoft releases a bold new update for Notepad

In a world first, Brazilians will soon be able to sell their digital data

The Rise and Fall of Clippy: From Microsoft's Bold Vision to Internet Legend

Show HN: Firefox from Ramdisk (macOS Only)

Billion Cell Spreadsheet

Concerning Levels of Acoustic Spying Techniques [video]

Ask HN: Dashboard for X Stock Sentiment. What features would traders want?

Ask HN: Diff in jitted versus compiled code perf

Realtek eyes SSDs with new PCIe 5.0 x4 DRAM-less controller

WhatsApp will stop working on these iPhones starting June 1

From Typewriters to Transformers: AI Is Just the Next Tools Abstraction

X hits pause on its encrypted DMs feature

Cointelpro: The Untold American Story [pdf]

A misplaced MRI found a tumor on her spine. Doctors removed it through her eye

Why Are No Ultra Large Container Vessels Sailing to the United States (2024) [video]

Dijkstra never took a biology course

Verifying Bots and Agents with Cryptography in the Age of AI

Show HN: Live network map display with engraved backlighed panels

Paid maternity leave policies could be costing women tech jobs

TrojanStego: Your Language Model Can Be a Steganographic Agent

The R Project: A Brief History and Thoughts About the Future [pdf]

How To Glow Up/How to be popular?

It's Waymo's World. We're All Just Riding in It

Satellite mega-swarms are blinding us to the cosmos

Continuous Neural Networks: A Physics-Inspired Framework

Anthropic's Circuit Tracer

Precision Clock Mk IV

Why Intellectuals are F*cking Idiots [video]

The Rise of the Japanese Toilet

Pricing Strategy

Microsoft releases a bold new update for Notepad

In a world first, Brazilians will soon be able to sell their digital data

The Rise and Fall of Clippy: From Microsoft's Bold Vision to Internet Legend

Show HN: Firefox from Ramdisk (macOS Only)

Billion Cell Spreadsheet

Tokenization for language modeling: BPE vs. Unigram Language Modeling (2020)

Comments