I agree that it would be more useful to compare vocabs of identical size.
e.g. https://github.com/karpathy/nanoGPT/blob/master/data/shakesp...
I have read this paper before, and now that I have looked more closely I am a bit disappointed that the precise method of construction of the large initial vocabularies used for vocabulary learners that begin with large initial vocabularies in the experiment is not specified. I also notice that the same error I have proposed correcting in the docs is present in the paper in section 3.1.
Additional edit: Alright, I see that A.2.1 specifies that the "n-gram" initial vocab condition means the most frequent 2^18 n-grams for n in 1..L. This seems like not very many, and not weighting n-grams by "potential CTC reduction" (=length-1) for inclusion in this initial vocab seems to create a bias against long n-grams. I'm also sort of wary of these greedy learners, because many candidates for inclusion in the vocab are interdependent with other candidates (the marginal loss or gain of a candidate may be near 0 because of the presence of some other token which may not actually be optimal to include, or there may be groups of included tokens which cause other better groups to have near 0 marginal loss or gain), so I would really want to avoid being too strongly path-dependent in the learner. If I may hand wave a great deal, I think this sort of issue also arises if your objective function is not CTC.
With regard to weighting the n-grams by length*frequency, I'm not sure it is clear that that would be better. The SentencePiece unigram model does it that way (as I mentioned in another comment), and hence, unigram produces longer tokens on average. It is generally considered that this is a bit of an issue with unigram. Not that there is particular evidence either way, as with many things in tokenization.
Why do you think 2^18 initial n-grams is too few? That's 5.3 times more than the largest vocab we train.
Rather than splitting based on a predefined list you can check with a text editor, BLTs use a neural net to tokenize. So it will be pretty hard to debug.
Basically, prior to feeding text to the tokenizer people have split the text on whitespaces. But whitespaces aren't exactly meaningful separators. By getting rid of this restriction as the tokenizer is 'learning', some of the tokens end up being 'by the way' or 'in the the long run.' The researchers find that this makes the model much more efficient.
"Once upon a time, in a small town, there was a huge town"
Which I thought was the brilliant start for an interesting story. Just that slight deviation from expectation provides a excellent scaffold for your imagination to take hold.
Of course the rest of the generation was utter rubbish, but if you could on-demand deviate from standard at key points while keeping the majority internally consistent then it would be great.
With BPE you get different encodings based upon punctuation and whitespace whenever it decides to pair the characters before and after a word into the word itself.
The tokenizer I have been using breaks words into components of:
prefix character
prefix character repeat count
index of lowercase(word)
index of capialization mode (0 = all lower, 1 = initial cap, 2= all caps)
suffix character
The encoder encodes the character before and after the word double encoding the information into the words previous and next, the decoder decreases the prefix count by one it it matches the suffix of the previous word. Anything that doesn't match a known word or is weirdly capitalized gets character encoded as a series of prefix-suffix characters with no word body. (a separate channel for chars would probably be better)Each channel gets their own embedding table and the embeddings from all the channels are summed before being passed into the transformer.
Decode is just a separate last stage layer translating the final embedding into channels. In hindsight it should have been a little more than that for the decode because if the options for the next word were split between " YELLED!" and " whispered." my current system could theoretically produce " WHISPERED.". In practice it doesn't seem to do that much, but that means it's had to learn something to deal with that (I suspect by limiting variation) adding a little smarts to the end tokenization would help, perhaps choose the word index first then use the embedding for the match to filter the predicted embedding before calculating the other channels.
I have not yet done anything on breaking words themselves up. I have been using tinystories for training so there hasn't been need for it with so few unique words. I have given it it a lot of thought though and I think I would contest the 'Gold standard' encodings. I think a word like 'nodular' should be encoded as something like 'nodule' '<of or relating to modifier> <minus e> ar'
It's a little hard to comprehend what this might be for other languages, but I think there is probably insights to be had if you tried to make something that encoded English, Latin and Kanji equally well.
I'd be curious to know what the total number of homonyms there are across languages, too. Just a single signal to say 'this one is a known homonym' would probably be beneficial. If the total number is low enough, having their own token range might work too.
anonymoushn•1d ago
mcyc•1d ago
BPE builds vocabularies from the base up so I assume you are talking about Unigram which starts with a big vocabulary and trims it.
The details of UnigramLM are here https://arxiv.org/pdf/1804.10959, and the part about vocabulary seeding is Section 3.2.
Basically, it just selects all substrings that appear in the corpus up to a certain length (and then maybe trims it a little by discarding rare substrings or something to reduce the initial size a bit and make things faster).
anonymoushn•1d ago
Anyway, the paper says "Frequent substrings can be enumerated in O(T) time and O(20T) space with the Enhanced Suffix Array algorithm (Nong et al., 2009)", which is hilariously underspecified, at least in part because a suffix array algorithm isn't a top-k algorithm.
cschmidt•2h ago
See line 233: https://github.com/google/sentencepiece/blob/master/src/unig...
I would suspect the n-gram counts don't cross pre-token boundaries, but I don't have time to find that in the code right now.