frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Start all of your commands with a comma (2009)

https://rhodesmill.org/brandon/2009/commands-with-comma/
256•theblazehen•2d ago•85 comments

Hoot: Scheme on WebAssembly

https://www.spritely.institute/hoot/
26•AlexeyBrin•1h ago•2 comments

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

https://openciv3.org/
706•klaussilveira•15h ago•206 comments

The Waymo World Model

https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-frontier-for-autonomous-driving-simula...
969•xnx•21h ago•558 comments

Vocal Guide – belt sing without killing yourself

https://jesperordrup.github.io/vocal-guide/
69•jesperordrup•6h ago•31 comments

Reinforcement Learning from Human Feedback

https://arxiv.org/abs/2504.12501
7•onurkanbkrc•47m ago•0 comments

Making geo joins faster with H3 indexes

https://floedb.ai/blog/how-we-made-geo-joins-400-faster-with-h3-indexes
135•matheusalmeida•2d ago•35 comments

Where did all the starships go?

https://www.datawrapper.de/blog/science-fiction-decline
45•speckx•4d ago•36 comments

Unseen Footage of Atari Battlezone Arcade Cabinet Production

https://arcadeblogger.com/2026/02/02/unseen-footage-of-atari-battlezone-cabinet-production/
68•videotopia•4d ago•7 comments

Welcome to the Room – A lesson in leadership by Satya Nadella

https://www.jsnover.com/blog/2026/02/01/welcome-to-the-room/
39•kaonwarb•3d ago•30 comments

ga68, the GNU Algol 68 Compiler – FOSDEM 2026 [video]

https://fosdem.org/2026/schedule/event/PEXRTN-ga68-intro/
13•matt_d•3d ago•2 comments

What Is Ruliology?

https://writings.stephenwolfram.com/2026/01/what-is-ruliology/
45•helloplanets•4d ago•46 comments

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

https://github.com/valdanylchuk/breezydemo
240•isitcontent•16h ago•26 comments

Monty: A minimal, secure Python interpreter written in Rust for use by AI

https://github.com/pydantic/monty
238•dmpetrov•16h ago•126 comments

Show HN: I spent 4 years building a UI design tool with only the features I use

https://vecti.com
340•vecti•18h ago•149 comments

Hackers (1995) Animated Experience

https://hackers-1995.vercel.app/
506•todsacerdoti•23h ago•248 comments

Sheldon Brown's Bicycle Technical Info

https://www.sheldonbrown.com/
389•ostacke•22h ago•98 comments

Show HN: If you lose your memory, how to regain access to your computer?

https://eljojo.github.io/rememory/
304•eljojo•18h ago•188 comments

Microsoft open-sources LiteBox, a security-focused library OS

https://github.com/microsoft/litebox
361•aktau•22h ago•186 comments

An Update on Heroku

https://www.heroku.com/blog/an-update-on-heroku/
428•lstoll•22h ago•284 comments

Cross-Region MSK Replication: K2K vs. MirrorMaker2

https://medium.com/lensesio/cross-region-msk-replication-a-comprehensive-performance-comparison-o...
3•andmarios•4d ago•1 comments

PC Floppy Copy Protection: Vault Prolok

https://martypc.blogspot.com/2024/09/pc-floppy-copy-protection-vault-prolok.html
71•kmm•5d ago•10 comments

Was Benoit Mandelbrot a hedgehog or a fox?

https://arxiv.org/abs/2602.01122
23•bikenaga•3d ago•11 comments

Dark Alley Mathematics

https://blog.szczepan.org/blog/three-points/
96•quibono•4d ago•22 comments

The AI boom is causing shortages everywhere else

https://www.washingtonpost.com/technology/2026/02/07/ai-spending-economy-shortages/
26•1vuio0pswjnm7•2h ago•16 comments

How to effectively write quality code with AI

https://heidenstedt.org/posts/2026/how-to-effectively-write-quality-code-with-ai/
271•i5heu•18h ago•219 comments

Delimited Continuations vs. Lwt for Threads

https://mirageos.org/blog/delimcc-vs-lwt
34•romes•4d ago•3 comments

I now assume that all ads on Apple news are scams

https://kirkville.com/i-now-assume-that-all-ads-on-apple-news-are-scams/
1079•cdrnsf•1d ago•461 comments

Introducing the Developer Knowledge API and MCP Server

https://developers.googleblog.com/introducing-the-developer-knowledge-api-and-mcp-server/
64•gfortaine•13h ago•30 comments

Understanding Neural Network, Visually

https://visualrambling.space/neural-network/
306•surprisetalk•3d ago•44 comments
Open in hackernews

Show HN: TokenDagger – A tokenizer faster than OpenAI's Tiktoken

https://github.com/M4THYOU/TokenDagger
281•matthewolfe•7mo ago
TokenDagger is a drop-in replacement for OpenAI’s Tiktoken (the tokenizer behind Llama 3, Mistral, GPT-3.*, etc.). It’s written in C++ 17 with thin Python bindings, keeps the exact same BPE vocab/special-token rules, and focuses on raw speed.

I’m teaching myself LLM internals by re-implementing the stack from first principles. Profiling TikToken’s Python/Rust implementation showed a lot of time was spent doing regex matching. Most of my perf gains come from a) using a faster jit-compiled regex engine; and b) simplifying the algorithm to forego regex matching special tokens at all.

Benchmarking code is included. Notable results show: - 4x faster code sample tokenization on a single thread. - 2-3x higher throughput when tested on a 1GB natural language text file.

Comments

chrismustcode•7mo ago
There’s something beautiful about creating a drop in replacement for something that improves performance substantially.

ScyllaDB comes to mind

matthewolfe•7mo ago
Agreed. I figured nobody would use it otherwise.
parhamn•7mo ago
Put it in there readme & description. It's a big selling point.
matthewolfe•7mo ago
Thanks, I clarified it.
pvg•7mo ago
To be fair, many people have token stabbing needs.
npalli•7mo ago
Kudos, I think (in the short term at least) there is a large amount of perf. optimization to be found by coding parts of the whole AI/ML infrastructure in C++ like this one, not as a rewrite (god no!) but drop in and fix key bottlenecks. Anytime I see someone (seems Chinese engineers are good at this) put something out in C++, good chance some solid engineering tradeoffs have been made and dramatic improvement will be seen.
matthewolfe•7mo ago
Agreed. A former mentor of mine told me a nice way of viewing software development:

1. Make it work. 2. Make it fast. 3. Make it pretty.

Transformers & LLMs have been developed to a point where they work quite well. I feel as though we're at a stage where most substantial progress is being made on the performance side.

diggan•7mo ago
Heh, seems people I've been learning from been biased away from beauty, as I know that as "Make It Work, Make It Right, Make It Fast".
abybaddi009•7mo ago
What's the difference between make it work and make it right? Aren't they the same thing?
stavros•7mo ago
Yeah, if it's not right, it doesn't work.
darknoon•7mo ago
In ML, often it does work to a degree even if it's not 100% correct. So getting it working at all is all about hacking b/c most ideas are bad and don't work. Then you'll find wins by incrementally correcting issues with the math / data / floating point precision / etc.
DSingularity•7mo ago
Not true. Things can work with hacks. Your standards might consider it unacceptable to have hacks. So you can have a “make it right” stage.
gabrielhidasy•7mo ago
Depends on your definition of "right" and "work". It could be a big ball of mud that always returns exactly the required response (so it 'works'), but be hellish hard change and very picky about dependencies and environment (so it's not 'right').
stavros•7mo ago
Nope, it's right, but it's not pretty.
robertfw•7mo ago
Making it work can be a hacky, tech debt laden implementation. Making it right involves refactoring/rewriting with an eye towards maintainability, testability, etc etc
gopalv•7mo ago
> make it work and make it right?

My mentor used say it is the difference between a screw and glue.

You can glue some things together and prove that it works, but eventually you learn that anytime you had to break something to fix it, you should've used a screw.

It is trade off in coupling - the glue binds tightly over the entire surface but a screw concentrates the loads, so needs maintenance to stay tight.

You only really know which is "right" it if you test it to destruction.

All of that advice is probably sounding date now, even in material science the glue might be winning (see the Tesla bumper or Lotus Elise bonding videos - every screw is extra grams).

kevindamm•7mo ago
I've usually heard/said it as

  1. Make it
  2. Make it work
  3. Make it work better
(different circumstances have different nuances about what "better" means, it isn't always performance optimization; some do substitute "faster" for "better" here, but I think it loses generality then).
acosmism•7mo ago
i like this version best
gabrielhidasy•7mo ago
I always heard the "Make it Right" as "Make it Beautiful", where Right and Beautiful would mean "non-hacky, easily maintainable, easily extendable, well tested, and well documented"
matthewolfe•7mo ago
Fair chance I'm remembering it wrong :D
mindcrime•7mo ago
I've always heard it (and said it) as:

  1. Make it work
  2. Make it correct
  3. Make it fast
binarymax•7mo ago
The Huggingface transformers lib is currently undergoing a refactor to get rid of cruft and make it more extensible, hopefully with some perf gains.
jotux•7mo ago
A similar concept dates back to 30BC: https://en.wikipedia.org/wiki/De_architectura

Firmitas, utilitas, venustas - Strong, useful, and beautiful.

saretup•7mo ago
And while we’re at it, let’s move away from Python altogether. In the long run it doesn’t make sense just because it’s the language ML engineers are familiar with.
tbalsam•7mo ago
No! This is not good.

Iteration speed trumps all in research, most of what Python does is launch GPU operations, if you're having slowdowns from Pythonland then you're doing something terribly wrong.

Python is an excellent (and yes, fast!) language for orchestrating and calling ML stuff. If C++ code is needed, call it as a module.

bigyabai•7mo ago
It makes plenty of sense. Python handles strings well, has a great package ecosystem, and is easy to write/learn for non-programmers. It can be easily embedded into a notebook (which is huge for academics) and is technically a "write once run anywhere" platform in theory. It's great.

If you think Python is a bad language for AI integrations, try writing one in a compiled language.

mdaniel•7mo ago
> has a great package ecosystem

So great there are 8 of them. 800% better than all the rest!

> If you think Python is a bad language for AI integrations, try writing one in a compiled language.

I'll take this challenge, all day, every day, so long as I and the hypothetical 'move fast and break things' have equal "must run in prod" and "must be understandable by some other human" qualifiers

What type is `array`? Don't worry your pretty head about it, feed it whatever type you want and let Sentry's TypeError sort it out <https://github.com/openai/whisper/blob/v20250625/whisper/aud...> Oh, sorry, and you wanted to know what `pad_or_trim` returns? Well that's just, like, your opinion man

bigyabai•7mo ago
Tracks with me, I don't like using Python for real programming. Try explaining any of your "Python sucks" catechisms to a second-year statistics student though. If you'd rather teach them C++, be my guest. If you want to make them indebted to proprietary infra like Mojo or CUDA, knock yourself out.

I'm still teaching them Python.

janalsncm•7mo ago
Most of that is already happening under the hood. A lot of performance-sensitive code is already written in C or cython. For example numpy, scikit learn, pandas. Lots of torch code is either C or CUDA.

ML researchers aren’t using python because they are dumb. They use it because what takes 8 lines in Java can be done with 2 or 3 (including import json) in python for example.

ipsum2•7mo ago
Sort of. The key bottlenecks are not in tokenization, but running the actual CUDA kernels. Python actually has very little overhead. (See VLLM, which is primarily in Python). So when people (like deepseek) 'rewrite in C++', they're usually just rewriting CUDA kernels to be more efficient.
notatallshaw•7mo ago
It looks like TikToken is written in Rust (https://github.com/openai/tiktoken/tree/main/src), are the gains here actually from porting to C++?
fhub•7mo ago
From the post

Profiling TikToken’s Python/Rust implementation showed a lot of time was spent doing regex matching. Most of my perf gains come from a) using a faster jit-compiled regex engine; and b) simplifying the algorithm to forego regex matching special tokens at all.

konsalexee•7mo ago
> simplifying the algorithm to forego regex matching special tokens at all

Does that mean there could be cases with less quality in terms of tokenization?

matthewolfe•7mo ago
The output should be identical, assuming no bugs.

The Tiktoken implementation takes a collection of all special tokens upon initialization and compiles them into a regex by joining them with `|` [0]. Then the actual encoding process checks for matches on this expression.

Models like Llama 4 define a list of 1,135 special tokens. Notably, 1,115 of those are "reserved" special tokens! So this yields a huge regexp of special tokens that shouldn't be considered at all.

TokenDagger does not do this. Instead, simple string matching is used. This works because we don't need to consider the entire special vocabulary every time. The caller of `encode` must explicitly define which special tokens should be considered [1]. So it's faster to check against the much smaller list we _know_ is being used.

[0] https://github.com/openai/tiktoken/blob/main/src/lib.rs#L476

[1] https://github.com/openai/tiktoken/blob/main/tiktoken/core.p...

anonymoushn•6mo ago
Isn't this incorrect? If the user doesn't specify what to do with almost all of the special tokens, you still must detect them so you can raise an error.
manishsharan•7mo ago
Is there a tokenizer someone can recommend for code ? I have tried CodeBert but maybe I am using it wrong as my results with it were pretty bad.
fkyoureadthedoc•7mo ago
Would be cool to see WASM bindings for this here https://github.com/dqbd/tiktoken

Or maybe even your speedups from "b" in the pure js implementation

p0•7mo ago
How does this compare to the BPE crate [1]? Its main selling point is support for incrementally re-tokenising text, but it's also faster than tiktoken.

[1] https://crates.io/crates/bpe

matthewolfe•7mo ago
I'm working on incremental re-tokenizing next. Then I'll run some benchmarks against this crate too.
frabcus•7mo ago
Is there any way we can get local tokenizers for other LLMs? e.g. Gemini only offer a remote API for their tokenizer. Is it proprietary? Could we infer the token mapping somehow efficiently by making lots of calls?
weberer•7mo ago
I thought Gemini used SentencePiece

https://github.com/google/sentencepiece

Deathmax•7mo ago
Gemini uses SentencePiece [1], and the proprietary Gemini models share the same tokenizer vocabulary as Gemma [2, 3, 4].

Out of the large proprietary western AI labs (OpenAI, Anthropic, Google), only Anthropic with Claude 3 and newer lack local tokenizers.

[1] https://github.com/google/sentencepiece

[2] https://github.com/googleapis/python-aiplatform/blob/main/ve...

[3] https://storage.googleapis.com/deepmind-media/gemma/gemma-2-...: "We inherit from the large Gemini vocabulary (256k entries)."

[4] https://storage.googleapis.com/deepmind-media/gemma/Gemma3Re...: "We use the same tokenizer as Gemini 2.0."

matthewolfe•7mo ago
A lot of model-specific tokenizers have reference implementations ([0], [1]). Underlying them is a core algorithm like SentencePiece or Byte-pair encoding (BPE). Tiktoken and TokenDagger are BPE implementations. The wrapping "tokenizer" mostly deals with the quirks of the vocabulary and handling special tokens.

For this project, I think there is value in building some of these model-specific quirks into the library. Could see some minor performance gains and generally make it easier to integrate with. It's probably not too much work to keep up with newer models. Tokenizers change much less frequently.

[0] https://github.com/meta-llama/llama-models/blob/01dc8ce46fec...

[1] https://github.com/mistralai/mistral-common/tree/main/src/mi...

pama•7mo ago
Cool. Would it be possible to eliminate that little vocab format conversion requirement for the vocab I see in the test against tiktoken? It would be nice to have a fully compatible drop in replacement without having to think about details. It also would be nice to have examples that work the other way around: initialize tiktoken as you normally would, including any specialized extension of standard tokenizers, and then use that initialized tokenizer to initialize a new tokendagger and test identity of results.
matthewolfe•7mo ago
Ah good catch. Updating this right now.
matthewolfe•7mo ago
Alright, 0.1.1 should now be a true drop-in replacement. I'll write up some examples soon.
janwilmake•7mo ago
You know what's also faster to roughly get the amount of tokens? string.length/5
_flux•7mo ago
It is not helpful in actual tokenization, though.
EGreg•7mo ago
What about pairing this with BigBird and Mamba?
pamelafox•7mo ago
Just curious whether it's possible to push any of your performance improvements to tiktoken itself?
matthewolfe•7mo ago
I probably will. Was hesitant initially, because adding PCRE2 as a dependency might cause issues to existing projects. I believe this was discussed briefly in a closed PR with other performance improvements.
kevmo314•7mo ago
Nice work! I tried something similar a while back ago: https://github.com/kevmo314/tokie

The takeaway I also found was that the running cost was really dominated by pretokenization (the regex). It's cool to see that you found a faster way to run the regex, but have you tried comparing the performance of just swapping out the regex engine and leaving the actual BPE to tiktoken? I wonder if that is upstreamable?

matthewolfe•7mo ago
Cool!

I've reached out to the guy who maintains Tiktoken to talk about this.

22c•7mo ago
There is at least some awareness already when it comes to the performance of the regex engine:

https://github.com/openai/tiktoken/blob/main/src/lib.rs#L95-...

polynomial•7mo ago
Just to note that Tiktoken is still the tokenizer behind the GPT-4x series, it just uses a different token model. (Post only says GPT-3, implying they were using something else for subsequent iterations.)
silentsea90•7mo ago
"I’m teaching myself LLM internals by re-implementing the stack from first principles." - curious what resources you're using? Any books or courses, or just building it straight up? Great work!
matthewolfe•7mo ago
Modal's GPU glossary is a good overview about how GPUs work [0]. Karpathy's LLM overview is a good high level overview on LLMs [1]. 3b1b's video (and subsequent videos) on transformers was excellent at helping me understand the math at a high level [2]. This matrix multiplication optimization worklog helped me understand writing better CUDA (not for beginner intro though) [3].

During this process I also asked ChatGPT a lot of questions.

I'm definitely open to suggestions about "how to learn" with all the new tools we have. I felt this has not been straightforward to figure out.

[0] https://modal.com/gpu-glossary

[1] https://www.youtube.com/watch?v=7xTGNNLPyMI

[2] https://www.youtube.com/watch?v=wjZofJX0v4M

[3] https://siboehm.com/articles/22/CUDA-MMM

matrix2596•7mo ago
is is possible for your tokenizer to give different tokenization ever then openai tokenizer? i am asking because there are multiple ways to tokenize the same string?? sry if i am mistaken
matthewolfe•7mo ago
Should be the same. Both use Byte-Pair Encoding (BPE) as underlying algo.
Tiberium•7mo ago
Can you also compare the performance with https://github.com/huggingface/tokenizers/? Would be helpful, since the benchmark in the tiktoken readme seems to be very outdated.
binarymax•7mo ago
Anecdotally I've always found tiktoken to be far slower than huggingface tokenizers. I'm not sure why, as I haven't dug into tiktoken, but I'm a heavy user of HF's rust tokenizers
superlopuh•7mo ago
Can someone familiar with performance of LLMs please tell me how important this is to the overall perf? I'm interested in looking into optimizing tokenizers, and have not yet run the measurements. I would have assumed that the cost is generally dominated by matmuls but am encouraged by the reception of this post in the comments.
serjester•7mo ago
Tokenizing text is ridiculously small part of the overall computation that goes into serving a request. With that said if you’re doing this on petabytes of data, never hurts to have something faster.
odyssey7•7mo ago
A language that isn’t memory-safe can definitely hurt. AI needs more security, not less.
refibrillator•7mo ago
Tokenization is typically done on CPU and is rarely (if ever) a bottleneck for training or inference.

GPU kernels typically dominate in terms of wall clock time, the only exception might be very small models.

Thus the latency of tokenization can essentially be “hidden”, by having the CPU prepare the next batch while the GPU finishes the current batch.

benreesman•7mo ago
Tokenization performance is complicated, but your guidepost is that the institutions with the resources and talent to do so choose to write extremely fast tokenizers: sentencepiece and tiktoken both pay dearly in complexity (particularly complexity of deployment because now you've got another axis of architecture-specific build/bundle/dylib to manage in addition to whatever your accelerator burden always was: its now aarch64 cross x86_64 cross CUDA capability...)

Sometimes it can overlap with accelerator issue, but pros look at flame graphs: a CPU core running the AVX lanes hard isn't keeping the bus fed, million things. People pre-tokenize big runs all the time.

I don't know why this thread is full of "nothing to see here", this obliterates the SOTA from the money is no object status quo: I'd like to think better of the community than the obvious which is that C++ is threatening a modest mindshare comeback against a Rust narrative that's already under pressure from the explosion of interest in Zig. Maybe there's a better reason.

matthewolfe•7mo ago
To echo the other replies, the tokenizer is definitely not the bottleneck. It just happens to be the first step in inference, so it's what I did first.
isjustintime•7mo ago
Very cool. We use Tiktoken and I'd love to see the performance impact. Pretty great decision to make it drop-in compatible.
sheerun•7mo ago
Now that byte-patch-level embeddings are discovered?
semiinfinitely•7mo ago
I'm relieved to see that its not written in rust
matthewolfe•7mo ago
haha, I thought about it.
singularity2001•7mo ago
this is still the outdated architecture without special tokens for numbers like out-of-vocab tokens like NUM_FLOAT(3.1415) right?
justinhj•7mo ago
I've been playing with tokenization too. Starting from Kaparthy's Python minbpe I set myself the task of training a tokenizer on wikitext (500mb) in a reasonable time. I got the C++ version down to about 50 minutes compared to the original Python code (estimated) several months.

Haven't really spent much time looking at encode and decode but I plan to incorporate these regex modifications when I do!

https://github.com/justinhj/minbpe-cc