From text to token: How tokenization pipelines work

https://www.paradedb.com/blog/when-tokenization-becomes-token

134•philippemnoel•1mo ago

Comments

wongarsu•1mo ago

Notably tokenization for traditional search. LLMs use very different tokenization with very different goals

tgv•1mo ago

It's a rather old-fashioned style of tokenization. In the 1980s this was common, I think. But, as noted in another comment, it doesn't work that well for languages with a richer morphology, or compounding. It's a very "English" approach.

jamesgresql•1mo ago

Chinese, Japanese, Korean etc.. don’t work like this either.

However, even though the approach is “old fashioned” it’s still widely used for English. I’m not sure there is a universal approach that semantic search could use that would be both fast and accurate?

At the end of the day people choose a tokenizer that matches their language.

I will update the article to make all this clearer though!

empiko•1mo ago

This was common even in 2015. You can still see people removing stop words from text, even when they feed it to LLMs. It's of course terrible for performance, but old habits die hard I guess.

jamesgresql•1mo ago

100%, maybe we should do a follow up on other types of tokenization.

semicognitive•1mo ago

ParadeDB is a great team, highly recommend using

the_arun•1mo ago

Just curious - if we remove stop words from prompts before going to LLM, wouldn't it reduce token size? Will it keep the response from LLM same (original vs without stop tokens)?

kylecazar•1mo ago

Search engines can afford to throw out stopwords because they're often keyword based. But (frontier) LLM's need the nuance and semantics they signal -- they don't automatically strip them. There are probably special purpose models that do this, or in certain parts of a RAG pipeline, but that's the exception.

Yeah, it'll be less input tokens if you omitted them yourself. It's not guaranteed to keep the response the same, though. You're asking the model to work with less context and more ambiguity at that point. So stripping your prompt of stopwords is going to save you negligible $ and potentially cost a lot in model performance.

cubefox•1mo ago

Don't know, but GPT-5 Thinking strips out a lot of words in its reasoning trace in order to save tokens. Someone on Twitter jailbroke it in order to get the original CoT traces.

gortok•1mo ago

My biggest complaints about search come from day-to-day uses:

I use search in my email pretty heavily, and I’m most interested in specific words in the email; and when those emails are from specific folks or a specific domain. But, the mobile version of Gmail produces different results than the mobile Outlook app than the desktop version of Gmail, and all of them are pretty terrible at search as it pertains to email.

I have a hard to getting them to pull up emails in search that I know exist, that I know have certain words, and I know have certain email addresses in the body.

I recognize a generalized searching mechanisms is going to get domain specific nuances wrong, but is it really so hard to make a search engine that works on email and email based attachments that no one cares enough to try?

mattnewton•1mo ago

Huh, maybe your use case is around the indexing of the contents of attachments? I basically never search for the contents of attachments, just the clip does of emails, and have found gmail search to be really good. I switched back to the web client from Mac’s native mail app for this reason because search has been so good for me in Gmail.

I haven’t looked, but I wonder if there is a good hackable email client that will let you substitute out the search index with a reasonable abstraction from all the complicated email protocol stuff. I feel like building an index for your use case is totally achievable if so.

heikkilevanto•1mo ago

Good explanation on tokenizing English text for regular search. But it is far from universal, and will not work well in Finnish, for example.

Folding diacritics makes "vähä" (little) into "vaha" (wax).

Dropping stop words like "The" misses the word for "tea" (in rather old-fashioned finnish, but also in current Danish).

Stemming Finnish words is also much more complex, as we tend to append suffixes to the words instead of small words in front to the word. "talo" is "house", "talosta" is "from the house", "talostani" is "from my house", and "talostaniko" makes it a question "from my house?"

If that sounds too easy, consider Japanese. From what little I know they don't use whitespace to separate words, mix two phonetic alphabets with Chinese ideograms, etc.

philippemnoel•1mo ago

That's true. For this reason, most modern search engines support language-aware stemming and tokenization. Popular tokenizers for CJK languages include Lindera and Jieba.

We (ParadeDB) use a search library called Tantivy under the hood, which supports stemming in Finnish, Danish and many other languages: https://docs.paradedb.com/documentation/token-filters/stemmi...

ashirviskas•1mo ago

Yep and I find that this really worsens LLM performance. For example `Ben,Alice` would be tokenized as `Ben|,A|lice`. And having to connect `lice` to the name `Alice` does not make it any easier for LLMs. However, formatting it as `Ben, Alice` tokenizes it as `Ben|,| Alice`. I found it kind of useful to improve performance by just formatting the data a bit differently.

I actually just started working on a data formatter that applies principles like these to drastically reduce the amount of tokens without decreasing the performance, like other formats do (looking at you, tson).

nawazgafar•1mo ago

You beat me to the punch. I wrote a blog post[1] with the exact same title last week! Though, I went into a bit more detail with regard to embedding layers, so maybe my title is not accurate.

1. https://gafar.org/blog/text-to-tokens

jamesgresql•1mo ago

Amazing, will have a read!

flakiness•1mo ago

Oh it's good old tokenization vs for-LLM tokenizations like sentence piece or tiktoken. We shouldn't forget there are non-ML simple things like this one which doesn't ask you to buy more GPUs.

jamesgresql•1mo ago

Haha, I like “good old tokenization”

6r17•1mo ago

I'm wondering if the english stopwords are not children of a forgotten declination that was forgotten from the language - ... ok so I had to check this out but I don't really have time to check more than with gemini - apparently - The word "the" is basically the sole survivor of a massive, complex table of declensions. In Old English, you could not just say "the." You had to choose the correct word based on gender, case, and number—exactly like you do in Polish today with ten, ta, to, tego, temu, tej, etc.

The Old English "The" (Definite Article) Case Masculine (Ten) Neuter (To) Feminine (Ta) Plural (Te) Nominative Se Þæt Sēo Þā Accusative Þone Þæt Þā Þā Genitive Þæs Þæs Þære Þāra Dative Þæm Þæm Þære Þæm Instrumental Þy Þy — —

I have read somewhere that polish was actually more precise language to be used with AI - I'm wondering if the idea of shortening words that apparently make no sense are not actually hurting it more - as noticed by the article though.

So I'm to wonder at this point - wouldn't it be worthy of exploring a tenser version of the language that might bridge that gap ? completely exploratory though I don't even know if that might be helpful idea other than being a toy

zk0•1mo ago

tl;dr with a match statement

anonymoushn•1mo ago

i love stemming, i love searching for "anime" and getting "animal"

Tiny C Compiler

SectorC: A C Compiler in 512 bytes

Speed up responses with fast mode

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Software factories and the agentic moment

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

Hoot: Scheme on WebAssembly

The silent death of Good Code

FDA intends to take action against non-FDA-approved GLP-1 drugs

The F Word

First Proof

I write games in C (yes, C) (2016)

Vocal Guide – belt sing without killing yourself

Eigen: Building a Workspace

Al Lowe on model trains, funny deaths and working with Disney

Start all of your commands with a comma (2009)

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Show HN: A luma dependent chroma compression algorithm (image compression)

Selection rather than prediction

The AI boom is causing shortages everywhere else

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

Reinforcement Learning from Human Feedback

Unseen Footage of Atari Battlezone Arcade Cabinet Production

A Fresh Look at IBM 3270 Information Display System

72M Points of Interest

Coding agents have replaced every framework I used

Learning from context is harder than we thought

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Tiny C Compiler

SectorC: A C Compiler in 512 bytes

Speed up responses with fast mode

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Software factories and the agentic moment

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

Hoot: Scheme on WebAssembly

The silent death of Good Code

FDA intends to take action against non-FDA-approved GLP-1 drugs

The F Word

First Proof

I write games in C (yes, C) (2016)

Vocal Guide – belt sing without killing yourself

Eigen: Building a Workspace

Al Lowe on model trains, funny deaths and working with Disney

Start all of your commands with a comma (2009)

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Show HN: A luma dependent chroma compression algorithm (image compression)

Selection rather than prediction

The AI boom is causing shortages everywhere else

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

Reinforcement Learning from Human Feedback

Unseen Footage of Atari Battlezone Arcade Cabinet Production

A Fresh Look at IBM 3270 Information Display System

72M Points of Interest

Coding agents have replaced every framework I used

Learning from context is harder than we thought

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

From text to token: How tokenization pipelines work

Comments