frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Precision Clock Mk IV

https://mitxela.com/projects/precision_clock_mk_iv
230•ahlCVA•4h ago•73 comments

A Lean companion to Analysis I

https://terrytao.wordpress.com/2025/05/31/a-lean-companion-to-analysis-i/
93•jeremyscanvic•2h ago•5 comments

Oxfordshire clock still keeping village on time after 500 years

https://www.bbc.com/news/articles/cz70p0qevlro
60•1659447091•2d ago•27 comments

Show HN: PunchCard Key Backup

https://github.com/volution/punchcard-key-backup
60•ciprian_craciun•3h ago•23 comments

We're beating $359M in funding with two people and OCaml

https://terrateam.io/blog/punching-above-weight
7•imadj•37m ago•0 comments

Photos taken inside musical instruments

https://www.dpreview.com/photography/5400934096/probe-lenses-and-focus-stacking-the-secrets-to-incredible-photos-taken-inside-instruments
893•worik•23h ago•46 comments

AtomVM, the Erlang virtual machine for IoT devices

https://www.atomvm.net/
129•ahamez•3d ago•39 comments

The Two Ideals of Fields

https://susam.net/two-ideals-of-fields.html
34•susam•5h ago•17 comments

Using Ed(1) as My Static Site Generator

https://aartaka.me/this-post-is-ed.html
37•BoingBoomTschak•5h ago•14 comments

Show HN: Fontofweb – Discover Fonts Used on a Website or Websites Using Font(s)

https://fontofweb.com
34•sim04ful•5h ago•18 comments

AI video you can watch and interact with, in real-time

https://experience.odyssey.world
79•olivercameron•3d ago•25 comments

Beware of Fast-Math

https://simonbyrne.github.io/notes/fastmath/
254•blobcode•12h ago•169 comments

Designing Pareto-optimal RAG workflows with syftr

https://www.datarobot.com/blog/pareto-optimized-ai-workflows-syftr/
26•roma_glushko•3d ago•7 comments

Using lots of little tools to aggressively reject the bots

https://lambdacreate.com/posts/68
117•archargelod•11h ago•57 comments

Gradients Are the New Intervals

https://www.mattkeeter.com/blog/2025-05-14-gradients/
112•surprisetalk•13h ago•41 comments

Acclimation of Osmoregulatory Function in Salmon

https://www.unm.edu/~toolson/salmon_osmoregulation.html
16•mooreds•5h ago•3 comments

Webb telescope helps refines Hubble constant, suggesting resolution rate debate

https://phys.org/news/2025-05-webb-telescope-refines-hubble-constant.html
76•pseudolus•3d ago•40 comments

Atlas: Learning to Optimally Memorize the Context at Test Time

https://arxiv.org/abs/2505.23735
13•og_kalu•5h ago•0 comments

Surprisingly fast AI-generated kernels we didn't mean to publish yet

https://crfm.stanford.edu/2025/05/28/fast-kernels.html
354•mfiguiere•23h ago•149 comments

Show HN: AI Peer Reviewer – Multiagent System for Scientific Manuscript Analysis

https://github.com/robertjakob/rigorous
74•rjakob•5h ago•65 comments

Show HN: I built an AI agent that turns ROS 2's turtlesim into a digital artist

https://github.com/Yutarop/turtlesim_agent
23•ponta17•9h ago•6 comments

Exploring a Language Runtime with Bpftrace

https://www.mgaudet.ca/technical/2025/5/28/exploring-a-language-runtime-with-bpftrace
5•mgaudet•3d ago•0 comments

The Illusion of Causality in Charts

https://filwd.substack.com/p/the-illusion-of-causality-in-charts
34•skadamat•3d ago•18 comments

The ‘white-collar bloodbath’ is all part of the AI hype machine

https://www.cnn.com/2025/05/30/business/anthropic-amodei-ai-jobs-nightcap
554•lwo32k•1d ago•998 comments

The Trackers and SDKs in ChatGPT, Claude, Grok and Perplexity

https://jamesoclaire.com/2025/05/31/the-trackers-and-sdks-in-chatgpt-claude-grok-and-perplexity/
57•ddxv•11h ago•2 comments

C++ to Rust Phrasebook

https://cel.cs.brown.edu/crp/
170•wcrichton•21h ago•57 comments

Beating Google's kernelCTF PoW using AVX512

https://anemato.de/blog/kctf-vdf
316•anematode•1d ago•91 comments

Microsandbox: Virtual Machines that feel and perform like containers

https://github.com/microsandbox/microsandbox
352•makeboss•1d ago•169 comments

Pure vs. Impure Iterators in Go

https://jub0bs.com/posts/2025-05-29-pure-vs-impure-iterators-in-go/
37•ingve•2d ago•13 comments

AccessOwl (YC S22) is hiring an AI TypeScript Engineer to connect 100s of SaaS

https://www.ycombinator.com/companies/accessowl/jobs/hfWAhVp-ai-enabled-senior-software-engineer-typescript-focus
1•mathiasn•12h ago
Open in hackernews

Tokenization for language modeling: BPE vs. Unigram Language Modeling (2020)

https://ndingwall.github.io/blog/tokenization
75•phewlink•1d ago

Comments

anonymoushn•1d ago
How does SentencePiece choose the initial vocabulary, which is trimmed down to determine the final vocabulary which has these desirable properties?
mcyc•1d ago
Just a minor nit: SentencePiece is a library, not a tokenization algorithm. It implements two tokenization algorithms, Unigram and BPE.

BPE builds vocabularies from the base up so I assume you are talking about Unigram which starts with a big vocabulary and trims it.

The details of UnigramLM are here https://arxiv.org/pdf/1804.10959, and the part about vocabulary seeding is Section 3.2.

Basically, it just selects all substrings that appear in the corpus up to a certain length (and then maybe trims it a little by discarding rare substrings or something to reduce the initial size a bit and make things faster).

anonymoushn•1d ago
If the library has two vocabulary learners, only one of which does the described thing, then isn't it unambiguous which implementation within the library the question refers to? And wouldn't it be ambiguous to instead say "how does Unigram do it" without referring to any particular implementation?

Anyway, the paper says "Frequent substrings can be enumerated in O(T) time and O(20T) space with the Enhanced Suffix Array algorithm (Nong et al., 2009)", which is hilariously underspecified, at least in part because a suffix array algorithm isn't a top-k algorithm.

cschmidt•6h ago
It appears to be the top n-grams scored by the product of frequency and length. Including the frequency weighting is a bit nonstandard among ablative methods.

See line 233: https://github.com/google/sentencepiece/blob/master/src/unig...

I would suspect the n-gram counts don't cross pre-token boundaries, but I don't have time to find that in the code right now.

mcyc•1h ago
You can cross whitespace boundaries by setting flag `--split-on-whitespace` to false (it's true by default).

https://github.com/google/sentencepiece/blob/master/doc/opti...

wejick•1d ago
The vibe of AI discussions was so much different with today's (or at least for the past 2.5 years at least). It's quite surreal how things moved so fast
macawfish•1d ago
It's always seemed like such low hanging fruit, so much semantic information just squandered in the thirst for larger models.
kevmo314•1d ago
Unfortunately the choice of tokenizer is baked into the model. If you want to innovate on the tokenizer, you have to train a whole base model yourself to prove it's better which makes the barrier for innovation pretty high.
mdaniel•1d ago
Apologies if this is a dumb question, but is there no "hello world"-ish sandbox for testing this theory? I can very easily imagine that trying to go head-to-head with R1 or such is going to be a boatload of GPU, but for just testing tokenizer head-to-head isn't there a smaller sized one that can be used in a bake off?
anonymoushn•1d ago
You can use modded-nanogpt for testing this sort of change. If you don't want to do anything really weird, you can just train whatever tokenizer you like on the training set, then retokenize the input, then run the existing training code. One person did this earlier this year using a Tokenmonster vocab and got better downstream performance in less train time.
pama•1d ago
What exact attempt with which Tokenmonster vocab do you refer to? Sometimes it is hard to conclude much from these efforts. For example, having a smaller vocabulary is typically only useful for small models where the compute cost of the softmax layer at the end of the decoder may still factor into the equation for the performance. Fixing the size of the vocabulary while increasing the rest of the model makes this inefficiency disappear.
anonymoushn•1d ago
https://x.com/alexjc/status/1881410039639863622

I agree that it would be more useful to compare vocabs of identical size.

pama•1d ago
Thanks! It seems to me that the performance gain here was due to a smaller vocab size. This type of change is almost guaranteed to backfire for larger models and larger datasets / lower loss requirements, so it probably is not very useful. Generally the historical trend has been to use larger vocabularies as the models got larger themselves.
anonymoushn•22h ago
Well, he says he achieved some downstream win on the same size, but it didn't translate into a win in perplexity, so he tried to do something else. Like I said, it's unfortunate.
anonymoushn•11h ago
I actually wonder if he could just claim a win by calculating validation set BPB for both equally-sized vocabs instead of targeting the same level of perplexity as in the speedrun finish line lol
z3c0•23h ago
To expand on the other comment, if you look under the data folder in nanoGPT, you can see examples of how to train the model using various data sources and encoders. "shakespeare_char" is probably the most rudimentary, only converting the characters of the input into integers.

e.g. https://github.com/karpathy/nanoGPT/blob/master/data/shakesp...

yuvalpinter•12h ago
We've been working on this problem for quite some time in my lab. We released a benchmark piling together several "intrinsic evaluations" that don't require model training. We're currently investigating correlations between performance on this benchmark and on downstream tasks. Here it is - https://github.com/MeLeLBGU/tokenizers_intrinsic_benchmark - where there's a link to the paper we introduced it in, used for checking how inference schemes work together with various token vocabs. It's for English, but cited work and citing work have some of these for other languages as well.
pbronez•1d ago
Is there a more current post with similar information? I’d love to see how contemporary tokenizers have improved on these algorithms.
anonymoushn•1d ago
They have not.
woodson•23h ago
There has been some innovation, e.g. SuperBPE (https://arxiv.org/abs/2503.13423) claims substantial gains on 8B-sized models.
yuvalpinter•12h ago
A few years ago we released SaGe, which is a contextual tokenizer, meaning that it builds a vocab that's fit for an LM use case because tokens are selected to appear within as clear a context set as possible in a corpus. https://github.com/MeLeLBGU/SaGe It does better than both BPE and UnigramLM in several benchmarks. https://aclanthology.org/2024.acl-short.73/
anonymoushn•12h ago
Cool, I can have a look. By the way, I've filed a PR against the pathpiece repository correcting an error in the docs.

I have read this paper before, and now that I have looked more closely I am a bit disappointed that the precise method of construction of the large initial vocabularies used for vocabulary learners that begin with large initial vocabularies in the experiment is not specified. I also notice that the same error I have proposed correcting in the docs is present in the paper in section 3.1.

Additional edit: Alright, I see that A.2.1 specifies that the "n-gram" initial vocab condition means the most frequent 2^18 n-grams for n in 1..L. This seems like not very many, and not weighting n-grams by "potential CTC reduction" (=length-1) for inclusion in this initial vocab seems to create a bias against long n-grams. I'm also sort of wary of these greedy learners, because many candidates for inclusion in the vocab are interdependent with other candidates (the marginal loss or gain of a candidate may be near 0 because of the presence of some other token which may not actually be optimal to include, or there may be groups of included tokens which cause other better groups to have near 0 marginal loss or gain), so I would really want to avoid being too strongly path-dependent in the learner. If I may hand wave a great deal, I think this sort of issue also arises if your objective function is not CTC.

cschmidt•6h ago
Somehow I didn't get any notifications of your PR. Sorry about that. I'll take a look.
cschmidt•6h ago
Co-author of the PathPiece paper here.

With regard to weighting the n-grams by length*frequency, I'm not sure it is clear that that would be better. The SentencePiece unigram model does it that way (as I mentioned in another comment), and hence, unigram produces longer tokens on average. It is generally considered that this is a bit of an issue with unigram. Not that there is particular evidence either way, as with many things in tokenization.

Why do you think 2^18 initial n-grams is too few? That's 5.3 times more than the largest vocab we train.

anonymoushn•5h ago
I think that the ideal number of initial n-grams would be large enough that adding additional initial n-grams has no effect on the output, because I expect to not be very good at tuning two different knobs.
cschmidt•6h ago
Regarding $O(n L^2)$ vs $O(n L)$, that was because we somewhat sloppily tend to use the term 'tokenization' for both training a tokenizer vocab, and for tokenizing a given document. In the paper, we tried to always call the latter one segmentation or inference. The former is $O(n L^2)$ per iteration, while the latter $O(n L)$. I'll update the README to be more explicit about this.
anonymoushn•5h ago
No, the segmentation algorithm you have implemented has runtime O(N*L^2). In particular, if you want to do a hash table lookup using a string key, that takes time proportional to the length of the string, not constant time.
cschmidt•14m ago
That's in interesting point. While your correct, of course, it is so common to consider a hash table lookup a O(1) operation, it never occurred to me. But in this case, the loops are actually really tight and the hash table lookup might be a significant part of the time, so it might well behave more like O(n L^2). I'll update the docs and paper.
janalsncm•1d ago
Interesting recent research from Meta going in what I would call the opposite direction from an interpretability standpoint: https://arxiv.org/abs/2412.09871

Rather than splitting based on a predefined list you can check with a text editor, BLTs use a neural net to tokenize. So it will be pretty hard to debug.

YossarianFrPrez•22h ago
This is a cool paper!

Basically, prior to feeding text to the tokenizer people have split the text on whitespaces. But whitespaces aren't exactly meaningful separators. By getting rid of this restriction as the tokenizer is 'learning', some of the tokens end up being 'by the way' or 'in the the long run.' The researchers find that this makes the model much more efficient.

yuvalpinter•12h ago
This can also be done within the tokenization framework, see our work here: https://arxiv.org/abs/2504.00178
woodson•3h ago
How does this differ from SuperBPE, which seems to pursue a similar goal? https://arxiv.org/abs/2503.13423

Looks like parallel invention. (I’m not associated with the paper or its authors.)

anonymoushn•1h ago
In SuperBPE, a fixed number of tokens are learned, and then the constraints of pretokenization are removed entirely, and then the remainder of the target vocab size is learned.

In Boundless BPE, no schedule must be chosen, because there is not any point at which the constraints of pretokenization are removed entirely. Instead, at any point in the learning process, merges between adjacent pretokens are permitted if the pretokens are each represented by a single token. There are some additional details about how the authors incorporate Picky BPE, which I will not try to repeat because I would probably get them wrong.

cschmidt•19m ago
Yes, they were concurrent work. (Co-author of BoundlessBPE here). A sibling comment describes the main differences. Our paper motivates why superwords can lead to such a big improvement, by overcoming a limit that pre-tokenization imposes on current tokenization methods. The SuperBPE paper has a wonderful set of downstream evaluation runs. So if you're interested in either, they are quite complimentary papers.
mrkramer•1d ago
So LLMs hallucinate by design which is of course bad for any serious work but then again I was thinking in which situations would hallucinations be actually useful....let's say poetry writing, lyrics writing etc., the crazier the better or for generating business ideas, "crazy ideas succeed" type logic.
Lerc•22h ago
While training something on TinyStories one of the early stage outputs was

"Once upon a time, in a small town, there was a huge town"

Which I thought was the brilliant start for an interesting story. Just that slight deviation from expectation provides a excellent scaffold for your imagination to take hold.

Of course the rest of the generation was utter rubbish, but if you could on-demand deviate from standard at key points while keeping the majority internally consistent then it would be great.

mrkramer•21h ago
Poetry is full of figures of speech which are words and phrases that intentionally differ from common logic and common knowledge so LLMs can actually be probabilistic and accidental generators of figures of speech. Every nonsense that LLMs generate can be interpreted as a protentional metaphor or an oxymoron. The sentence that you mentioned sounds like a badass metaphor to me "Once upon a time, in a small town, there was a huge town" or in another words: town might be small in size but it has soul.
Lerc•23h ago
I have been doing experiments along these lines. Initially I have avoided the issue with the semantics of words altogether because there is plenty of work to be done before you even get to the words themselves.

With BPE you get different encodings based upon punctuation and whitespace whenever it decides to pair the characters before and after a word into the word itself.

The tokenizer I have been using breaks words into components of:

   prefix character
   prefix character repeat count
   index of lowercase(word)
   index of capialization mode  (0 = all lower, 1 = initial cap, 2= all caps)
   suffix character
The encoder encodes the character before and after the word double encoding the information into the words previous and next, the decoder decreases the prefix count by one it it matches the suffix of the previous word. Anything that doesn't match a known word or is weirdly capitalized gets character encoded as a series of prefix-suffix characters with no word body. (a separate channel for chars would probably be better)

Each channel gets their own embedding table and the embeddings from all the channels are summed before being passed into the transformer.

Decode is just a separate last stage layer translating the final embedding into channels. In hindsight it should have been a little more than that for the decode because if the options for the next word were split between " YELLED!" and " whispered." my current system could theoretically produce " WHISPERED.". In practice it doesn't seem to do that much, but that means it's had to learn something to deal with that (I suspect by limiting variation) adding a little smarts to the end tokenization would help, perhaps choose the word index first then use the embedding for the match to filter the predicted embedding before calculating the other channels.

I have not yet done anything on breaking words themselves up. I have been using tinystories for training so there hasn't been need for it with so few unique words. I have given it it a lot of thought though and I think I would contest the 'Gold standard' encodings. I think a word like 'nodular' should be encoded as something like 'nodule' '<of or relating to modifier> <minus e> ar'

It's a little hard to comprehend what this might be for other languages, but I think there is probably insights to be had if you tried to make something that encoded English, Latin and Kanji equally well.

I'd be curious to know what the total number of homonyms there are across languages, too. Just a single signal to say 'this one is a known homonym' would probably be beneficial. If the total number is low enough, having their own token range might work too.