> The model never sees text. It sees a sequence of integer indices into its own private alphabet.
> So tokens aren’t “roughly like words” or “kind of like characters”. They’re the atoms of perception for one specific model, and they’re the only language that model speaks.
> The same sentence is nine tokens to GPT-4 and seven tokens to Llama 3. Not because Llama is smarter or the sentence changed, but because the two models have different vocabularies.
> That’s it. No clever scoring, no neural network.
Could people who use LLM to write articles at least prompt them to have a better style? I'm really tired of the default Claude style (a lot of Chinese models also reuse the same style)
s1monb•1h ago
I appreciate the feedback. My main focus was on the visual elements, and not so much "ridding the text of AI-traces".
What did you think about the more visual elements?
Simon
s1monb•58m ago
I will do better and link to the research and related sources in the next iteration.
Tiberium•56m ago
I was just pointing out how the article is clearly LLM written, probably including the interactive widgets. It's especially obvious because someone writing such an article in 2026 would at least find what the newest tokenizers are, instead of mentioning LLaMA 2/3 (!), and GPT's old tokenizer that they dropped since GPT-4o (or something close).
And, more obviously, the fact that GPT-4 is being directly named even though that model is over 3 years old by now: "Ask GPT-4, Claude, or Gemini today and they will usually answer three.".
Sorry, I just think that the article wasn't produced by a human at all.
s1monb•36m ago
> It's especially obvious because someone writing such an article in 2026 would at least find what what the newest tokenizers are
The underlying BPE algorithm, which is the main focus of this article, is the one used modern tokenizers today.
> The fact that GPT-4 is being directly named even though that model is over 3 years old by now
That is fair. Will be updated
> Sorry, I just think that the article wasn't produced by a human at all.
While I have used LLM to help me write and explain my content, my hopes is that most readers does not share this opinion of yours. Everything touched by AI is not slop, and I wanted to share the notes I created for myself.
Tiberium•1h ago
/s
Tiberium•1h ago
> The chunks aren’t characters and they aren’t words. They’re something more specific, and the specificity matters more than most people realize.
> Those numbers are real, but they hide what a token actually is.
> GPT-4’s vocabulary isn’t Claude’s. Claude’s isn’t Llama’s.
> The model never sees text. It sees a sequence of integer indices into its own private alphabet.
> So tokens aren’t “roughly like words” or “kind of like characters”. They’re the atoms of perception for one specific model, and they’re the only language that model speaks.
> The same sentence is nine tokens to GPT-4 and seven tokens to Llama 3. Not because Llama is smarter or the sentence changed, but because the two models have different vocabularies.
> That’s it. No clever scoring, no neural network.
Could people who use LLM to write articles at least prompt them to have a better style? I'm really tired of the default Claude style (a lot of Chinese models also reuse the same style)
s1monb•1h ago
What did you think about the more visual elements?
Simon
s1monb•58m ago
Tiberium•56m ago
And, more obviously, the fact that GPT-4 is being directly named even though that model is over 3 years old by now: "Ask GPT-4, Claude, or Gemini today and they will usually answer three.".
Sorry, I just think that the article wasn't produced by a human at all.
s1monb•36m ago
The underlying BPE algorithm, which is the main focus of this article, is the one used modern tokenizers today.
> The fact that GPT-4 is being directly named even though that model is over 3 years old by now
That is fair. Will be updated
> Sorry, I just think that the article wasn't produced by a human at all.
While I have used LLM to help me write and explain my content, my hopes is that most readers does not share this opinion of yours. Everything touched by AI is not slop, and I wanted to share the notes I created for myself.