Benchmark (enwik9, 1 GB):
Tokenizer Vocab Size Total Tokens Ratio Throughput GreedyPhrase 65,536 222,805,405 4.49x 47 MB/s Tiktoken o200k_base (GPT-4o) 200,019 270,616,861 3.70x 4.35 MB/s Tiktoken cl100k_base (GPT-4) 100,277 273,662,103 3.65x 7.13 MB/s
GreedyPhrase: 1.23x better than GPT-4, 1.21x better than GPT-4o. 1.5-3x smaller vocab, 6-11x higher encoding throughput.
How It Works:
1. Phrase Mining — Split into atoms (words, punctuation, whitespace). Mine bigrams/trigrams. Top phrases fill 95% vocab slots.
2. BPE Fallback — Train BPE on residual byte sequences. Fills remaining 5% vocab.
3. Greedy Encoding — Longest-match-first Trie. Byte fallback for unknowns (zero OOV).