frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Start all of your commands with a comma (2009)

https://rhodesmill.org/brandon/2009/commands-with-comma/
233•theblazehen•2d ago•68 comments

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

https://openciv3.org/
695•klaussilveira•15h ago•206 comments

Hoot: Scheme on WebAssembly

https://www.spritely.institute/hoot/
7•AlexeyBrin•1h ago•0 comments

The Waymo World Model

https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-frontier-for-autonomous-driving-simula...
962•xnx•20h ago•555 comments

How we made geo joins 400× faster with H3 indexes

https://floedb.ai/blog/how-we-made-geo-joins-400-faster-with-h3-indexes
130•matheusalmeida•2d ago•35 comments

Unseen Footage of Atari Battlezone Arcade Cabinet Production

https://arcadeblogger.com/2026/02/02/unseen-footage-of-atari-battlezone-cabinet-production/
67•videotopia•4d ago•6 comments

Vocal Guide – belt sing without killing yourself

https://jesperordrup.github.io/vocal-guide/
54•jesperordrup•5h ago•25 comments

ga68, the GNU Algol 68 Compiler – FOSDEM 2026 [video]

https://fosdem.org/2026/schedule/event/PEXRTN-ga68-intro/
11•matt_d•3d ago•2 comments

Jeffrey Snover: "Welcome to the Room"

https://www.jsnover.com/blog/2026/02/01/welcome-to-the-room/
37•kaonwarb•3d ago•27 comments

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

https://github.com/valdanylchuk/breezydemo
236•isitcontent•15h ago•26 comments

Monty: A minimal, secure Python interpreter written in Rust for use by AI

https://github.com/pydantic/monty
234•dmpetrov•16h ago•125 comments

Where did all the starships go?

https://www.datawrapper.de/blog/science-fiction-decline
33•speckx•3d ago•21 comments

UK infants ill after drinking contaminated baby formula of Nestle and Danone

https://www.bbc.com/news/articles/c931rxnwn3lo
12•__natty__•3h ago•0 comments

Show HN: I spent 4 years building a UI design tool with only the features I use

https://vecti.com
335•vecti•17h ago•147 comments

Hackers (1995) Animated Experience

https://hackers-1995.vercel.app/
502•todsacerdoti•23h ago•244 comments

Sheldon Brown's Bicycle Technical Info

https://www.sheldonbrown.com/
386•ostacke•21h ago•97 comments

Show HN: If you lose your memory, how to regain access to your computer?

https://eljojo.github.io/rememory/
300•eljojo•18h ago•186 comments

Microsoft open-sources LiteBox, a security-focused library OS

https://github.com/microsoft/litebox
361•aktau•22h ago•185 comments

An Update on Heroku

https://www.heroku.com/blog/an-update-on-heroku/
425•lstoll•21h ago•282 comments

PC Floppy Copy Protection: Vault Prolok

https://martypc.blogspot.com/2024/09/pc-floppy-copy-protection-vault-prolok.html
68•kmm•5d ago•10 comments

Dark Alley Mathematics

https://blog.szczepan.org/blog/three-points/
96•quibono•4d ago•22 comments

Was Benoit Mandelbrot a hedgehog or a fox?

https://arxiv.org/abs/2602.01122
21•bikenaga•3d ago•11 comments

The AI boom is causing shortages everywhere else

https://www.washingtonpost.com/technology/2026/02/07/ai-spending-economy-shortages/
19•1vuio0pswjnm7•1h ago•5 comments

How to effectively write quality code with AI

https://heidenstedt.org/posts/2026/how-to-effectively-write-quality-code-with-ai/
265•i5heu•18h ago•217 comments

Delimited Continuations vs. Lwt for Threads

https://mirageos.org/blog/delimcc-vs-lwt
33•romes•4d ago•3 comments

Introducing the Developer Knowledge API and MCP Server

https://developers.googleblog.com/introducing-the-developer-knowledge-api-and-mcp-server/
64•gfortaine•13h ago•28 comments

I now assume that all ads on Apple news are scams

https://kirkville.com/i-now-assume-that-all-ads-on-apple-news-are-scams/
1077•cdrnsf•1d ago•460 comments

Female Asian Elephant Calf Born at the Smithsonian National Zoo

https://www.si.edu/newsdesk/releases/female-asian-elephant-calf-born-smithsonians-national-zoo-an...
39•gmays•10h ago•13 comments

Understanding Neural Network, Visually

https://visualrambling.space/neural-network/
298•surprisetalk•3d ago•44 comments

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

https://infisical.com/blog/devops-to-solutions-engineering
154•vmatsiiako•20h ago•72 comments
Open in hackernews

Tokenization for language modeling: BPE vs. Unigram Language Modeling (2020)

https://ndingwall.github.io/blog/tokenization
80•phewlink•8mo ago

Comments

anonymoushn•8mo ago
How does SentencePiece choose the initial vocabulary, which is trimmed down to determine the final vocabulary which has these desirable properties?
mcyc•8mo ago
Just a minor nit: SentencePiece is a library, not a tokenization algorithm. It implements two tokenization algorithms, Unigram and BPE.

BPE builds vocabularies from the base up so I assume you are talking about Unigram which starts with a big vocabulary and trims it.

The details of UnigramLM are here https://arxiv.org/pdf/1804.10959, and the part about vocabulary seeding is Section 3.2.

Basically, it just selects all substrings that appear in the corpus up to a certain length (and then maybe trims it a little by discarding rare substrings or something to reduce the initial size a bit and make things faster).

anonymoushn•8mo ago
If the library has two vocabulary learners, only one of which does the described thing, then isn't it unambiguous which implementation within the library the question refers to? And wouldn't it be ambiguous to instead say "how does Unigram do it" without referring to any particular implementation?

Anyway, the paper says "Frequent substrings can be enumerated in O(T) time and O(20T) space with the Enhanced Suffix Array algorithm (Nong et al., 2009)", which is hilariously underspecified, at least in part because a suffix array algorithm isn't a top-k algorithm.

cschmidt•8mo ago
It appears to be the top n-grams scored by the product of frequency and length. Including the frequency weighting is a bit nonstandard among ablative methods.

See line 233: https://github.com/google/sentencepiece/blob/master/src/unig...

I would suspect the n-gram counts don't cross pre-token boundaries, but I don't have time to find that in the code right now.

mcyc•8mo ago
You can cross whitespace boundaries by setting flag `--split-on-whitespace` to false (it's true by default).

https://github.com/google/sentencepiece/blob/master/doc/opti...

cschmidt•8mo ago
Anyone reading this in the future, I meant to say the length weighting is a bit nonstandard. It is usually by frequency. Oops
wejick•8mo ago
The vibe of AI discussions was so much different with today's (or at least for the past 2.5 years at least). It's quite surreal how things moved so fast
blurbleblurble•8mo ago
It's always seemed like such low hanging fruit, so much semantic information just squandered in the thirst for larger models.
kevmo314•8mo ago
Unfortunately the choice of tokenizer is baked into the model. If you want to innovate on the tokenizer, you have to train a whole base model yourself to prove it's better which makes the barrier for innovation pretty high.
mdaniel•8mo ago
Apologies if this is a dumb question, but is there no "hello world"-ish sandbox for testing this theory? I can very easily imagine that trying to go head-to-head with R1 or such is going to be a boatload of GPU, but for just testing tokenizer head-to-head isn't there a smaller sized one that can be used in a bake off?
anonymoushn•8mo ago
You can use modded-nanogpt for testing this sort of change. If you don't want to do anything really weird, you can just train whatever tokenizer you like on the training set, then retokenize the input, then run the existing training code. One person did this earlier this year using a Tokenmonster vocab and got better downstream performance in less train time.
pama•8mo ago
What exact attempt with which Tokenmonster vocab do you refer to? Sometimes it is hard to conclude much from these efforts. For example, having a smaller vocabulary is typically only useful for small models where the compute cost of the softmax layer at the end of the decoder may still factor into the equation for the performance. Fixing the size of the vocabulary while increasing the rest of the model makes this inefficiency disappear.
anonymoushn•8mo ago
https://x.com/alexjc/status/1881410039639863622

I agree that it would be more useful to compare vocabs of identical size.

pama•8mo ago
Thanks! It seems to me that the performance gain here was due to a smaller vocab size. This type of change is almost guaranteed to backfire for larger models and larger datasets / lower loss requirements, so it probably is not very useful. Generally the historical trend has been to use larger vocabularies as the models got larger themselves.
anonymoushn•8mo ago
Well, he says he achieved some downstream win on the same size, but it didn't translate into a win in perplexity, so he tried to do something else. Like I said, it's unfortunate.
anonymoushn•8mo ago
I actually wonder if he could just claim a win by calculating validation set BPB for both equally-sized vocabs instead of targeting the same level of perplexity as in the speedrun finish line lol
z3c0•8mo ago
To expand on the other comment, if you look under the data folder in nanoGPT, you can see examples of how to train the model using various data sources and encoders. "shakespeare_char" is probably the most rudimentary, only converting the characters of the input into integers.

e.g. https://github.com/karpathy/nanoGPT/blob/master/data/shakesp...

yuvalpinter•8mo ago
We've been working on this problem for quite some time in my lab. We released a benchmark piling together several "intrinsic evaluations" that don't require model training. We're currently investigating correlations between performance on this benchmark and on downstream tasks. Here it is - https://github.com/MeLeLBGU/tokenizers_intrinsic_benchmark - where there's a link to the paper we introduced it in, used for checking how inference schemes work together with various token vocabs. It's for English, but cited work and citing work have some of these for other languages as well.
pbronez•8mo ago
Is there a more current post with similar information? I’d love to see how contemporary tokenizers have improved on these algorithms.
anonymoushn•8mo ago
They have not.
woodson•8mo ago
There has been some innovation, e.g. SuperBPE (https://arxiv.org/abs/2503.13423) claims substantial gains on 8B-sized models.
yuvalpinter•8mo ago
A few years ago we released SaGe, which is a contextual tokenizer, meaning that it builds a vocab that's fit for an LM use case because tokens are selected to appear within as clear a context set as possible in a corpus. https://github.com/MeLeLBGU/SaGe It does better than both BPE and UnigramLM in several benchmarks. https://aclanthology.org/2024.acl-short.73/
anonymoushn•8mo ago
Cool, I can have a look. By the way, I've filed a PR against the pathpiece repository correcting an error in the docs.

I have read this paper before, and now that I have looked more closely I am a bit disappointed that the precise method of construction of the large initial vocabularies used for vocabulary learners that begin with large initial vocabularies in the experiment is not specified. I also notice that the same error I have proposed correcting in the docs is present in the paper in section 3.1.

Additional edit: Alright, I see that A.2.1 specifies that the "n-gram" initial vocab condition means the most frequent 2^18 n-grams for n in 1..L. This seems like not very many, and not weighting n-grams by "potential CTC reduction" (=length-1) for inclusion in this initial vocab seems to create a bias against long n-grams. I'm also sort of wary of these greedy learners, because many candidates for inclusion in the vocab are interdependent with other candidates (the marginal loss or gain of a candidate may be near 0 because of the presence of some other token which may not actually be optimal to include, or there may be groups of included tokens which cause other better groups to have near 0 marginal loss or gain), so I would really want to avoid being too strongly path-dependent in the learner. If I may hand wave a great deal, I think this sort of issue also arises if your objective function is not CTC.

cschmidt•8mo ago
Somehow I didn't get any notifications of your PR. Sorry about that. I'll take a look.
cschmidt•8mo ago
Co-author of the PathPiece paper here.

With regard to weighting the n-grams by length*frequency, I'm not sure it is clear that that would be better. The SentencePiece unigram model does it that way (as I mentioned in another comment), and hence, unigram produces longer tokens on average. It is generally considered that this is a bit of an issue with unigram. Not that there is particular evidence either way, as with many things in tokenization.

Why do you think 2^18 initial n-grams is too few? That's 5.3 times more than the largest vocab we train.

anonymoushn•8mo ago
I think that the ideal number of initial n-grams would be large enough that adding additional initial n-grams has no effect on the output, because I expect to not be very good at tuning two different knobs.
cschmidt•8mo ago
Regarding $O(n L^2)$ vs $O(n L)$, that was because we somewhat sloppily tend to use the term 'tokenization' for both training a tokenizer vocab, and for tokenizing a given document. In the paper, we tried to always call the latter one segmentation or inference. The former is $O(n L^2)$ per iteration, while the latter $O(n L)$. I'll update the README to be more explicit about this.
anonymoushn•8mo ago
No, the segmentation algorithm you have implemented has runtime O(N*L^2). In particular, if you want to do a hash table lookup using a string key, that takes time proportional to the length of the string, not constant time.
cschmidt•8mo ago
That's in interesting point. While your correct, of course, it is so common to consider a hash table lookup a O(1) operation, it never occurred to me. But in this case, the loops are actually really tight and the hash table lookup might be a significant part of the time, so it might well behave more like O(n L^2). I'll update the docs and paper.
janalsncm•8mo ago
Interesting recent research from Meta going in what I would call the opposite direction from an interpretability standpoint: https://arxiv.org/abs/2412.09871

Rather than splitting based on a predefined list you can check with a text editor, BLTs use a neural net to tokenize. So it will be pretty hard to debug.

YossarianFrPrez•8mo ago
This is a cool paper!

Basically, prior to feeding text to the tokenizer people have split the text on whitespaces. But whitespaces aren't exactly meaningful separators. By getting rid of this restriction as the tokenizer is 'learning', some of the tokens end up being 'by the way' or 'in the the long run.' The researchers find that this makes the model much more efficient.

yuvalpinter•8mo ago
This can also be done within the tokenization framework, see our work here: https://arxiv.org/abs/2504.00178
woodson•8mo ago
How does this differ from SuperBPE, which seems to pursue a similar goal? https://arxiv.org/abs/2503.13423

Looks like parallel invention. (I’m not associated with the paper or its authors.)

anonymoushn•8mo ago
In SuperBPE, a fixed number of tokens are learned, and then the constraints of pretokenization are removed entirely, and then the remainder of the target vocab size is learned.

In Boundless BPE, no schedule must be chosen, because there is not any point at which the constraints of pretokenization are removed entirely. Instead, at any point in the learning process, merges between adjacent pretokens are permitted if the pretokens are each represented by a single token. There are some additional details about how the authors incorporate Picky BPE, which I will not try to repeat because I would probably get them wrong.

cschmidt•8mo ago
Yes, they were concurrent work. (Co-author of BoundlessBPE here). A sibling comment describes the main differences. Our paper motivates why superwords can lead to such a big improvement, by overcoming a limit that pre-tokenization imposes on current tokenization methods. The SuperBPE paper has a wonderful set of downstream evaluation runs. So if you're interested in either, they are quite complimentary papers.
mrkramer•8mo ago
So LLMs hallucinate by design which is of course bad for any serious work but then again I was thinking in which situations would hallucinations be actually useful....let's say poetry writing, lyrics writing etc., the crazier the better or for generating business ideas, "crazy ideas succeed" type logic.
Lerc•8mo ago
While training something on TinyStories one of the early stage outputs was

"Once upon a time, in a small town, there was a huge town"

Which I thought was the brilliant start for an interesting story. Just that slight deviation from expectation provides a excellent scaffold for your imagination to take hold.

Of course the rest of the generation was utter rubbish, but if you could on-demand deviate from standard at key points while keeping the majority internally consistent then it would be great.

mrkramer•8mo ago
Poetry is full of figures of speech which are words and phrases that intentionally differ from common logic and common knowledge so LLMs can actually be probabilistic and accidental generators of figures of speech. Every nonsense that LLMs generate can be interpreted as a protentional metaphor or an oxymoron. The sentence that you mentioned sounds like a badass metaphor to me "Once upon a time, in a small town, there was a huge town" or in another words: town might be small in size but it has soul.
Lerc•8mo ago
I have been doing experiments along these lines. Initially I have avoided the issue with the semantics of words altogether because there is plenty of work to be done before you even get to the words themselves.

With BPE you get different encodings based upon punctuation and whitespace whenever it decides to pair the characters before and after a word into the word itself.

The tokenizer I have been using breaks words into components of:

   prefix character
   prefix character repeat count
   index of lowercase(word)
   index of capialization mode  (0 = all lower, 1 = initial cap, 2= all caps)
   suffix character
The encoder encodes the character before and after the word double encoding the information into the words previous and next, the decoder decreases the prefix count by one it it matches the suffix of the previous word. Anything that doesn't match a known word or is weirdly capitalized gets character encoded as a series of prefix-suffix characters with no word body. (a separate channel for chars would probably be better)

Each channel gets their own embedding table and the embeddings from all the channels are summed before being passed into the transformer.

Decode is just a separate last stage layer translating the final embedding into channels. In hindsight it should have been a little more than that for the decode because if the options for the next word were split between " YELLED!" and " whispered." my current system could theoretically produce " WHISPERED.". In practice it doesn't seem to do that much, but that means it's had to learn something to deal with that (I suspect by limiting variation) adding a little smarts to the end tokenization would help, perhaps choose the word index first then use the embedding for the match to filter the predicted embedding before calculating the other channels.

I have not yet done anything on breaking words themselves up. I have been using tinystories for training so there hasn't been need for it with so few unique words. I have given it it a lot of thought though and I think I would contest the 'Gold standard' encodings. I think a word like 'nodular' should be encoded as something like 'nodule' '<of or relating to modifier> <minus e> ar'

It's a little hard to comprehend what this might be for other languages, but I think there is probably insights to be had if you tried to make something that encoded English, Latin and Kanji equally well.

I'd be curious to know what the total number of homonyms there are across languages, too. Just a single signal to say 'this one is a known homonym' would probably be beneficial. If the total number is low enough, having their own token range might work too.