frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Start all of your commands with a comma

https://rhodesmill.org/brandon/2009/commands-with-comma/
201•theblazehen•2d ago•61 comments

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

https://openciv3.org/
684•klaussilveira•15h ago•204 comments

The Waymo World Model

https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-frontier-for-autonomous-driving-simula...
957•xnx•20h ago•553 comments

Unseen Footage of Atari Battlezone Arcade Cabinet Production

https://arcadeblogger.com/2026/02/02/unseen-footage-of-atari-battlezone-cabinet-production/
65•videotopia•4d ago•3 comments

How we made geo joins 400× faster with H3 indexes

https://floedb.ai/blog/how-we-made-geo-joins-400-faster-with-h3-indexes
126•matheusalmeida•2d ago•35 comments

Jeffrey Snover: "Welcome to the Room"

https://www.jsnover.com/blog/2026/02/01/welcome-to-the-room/
28•kaonwarb•3d ago•23 comments

Vocal Guide – belt sing without killing yourself

https://jesperordrup.github.io/vocal-guide/
44•jesperordrup•5h ago•22 comments

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

https://github.com/valdanylchuk/breezydemo
236•isitcontent•15h ago•26 comments

Monty: A minimal, secure Python interpreter written in Rust for use by AI

https://github.com/pydantic/monty
230•dmpetrov•15h ago•122 comments

Where did all the starships go?

https://www.datawrapper.de/blog/science-fiction-decline
25•speckx•3d ago•14 comments

Show HN: I spent 4 years building a UI design tool with only the features I use

https://vecti.com
332•vecti•17h ago•145 comments

Hackers (1995) Animated Experience

https://hackers-1995.vercel.app/
499•todsacerdoti•23h ago•244 comments

Sheldon Brown's Bicycle Technical Info

https://www.sheldonbrown.com/
384•ostacke•21h ago•96 comments

ga68, the GNU Algol 68 Compiler – FOSDEM 2026 [video]

https://fosdem.org/2026/schedule/event/PEXRTN-ga68-intro/
7•matt_d•3d ago•2 comments

Microsoft open-sources LiteBox, a security-focused library OS

https://github.com/microsoft/litebox
360•aktau•21h ago•183 comments

Show HN: If you lose your memory, how to regain access to your computer?

https://eljojo.github.io/rememory/
294•eljojo•18h ago•184 comments

An Update on Heroku

https://www.heroku.com/blog/an-update-on-heroku/
420•lstoll•21h ago•280 comments

PC Floppy Copy Protection: Vault Prolok

https://martypc.blogspot.com/2024/09/pc-floppy-copy-protection-vault-prolok.html
66•kmm•5d ago•10 comments

Dark Alley Mathematics

https://blog.szczepan.org/blog/three-points/
94•quibono•4d ago•22 comments

Was Benoit Mandelbrot a hedgehog or a fox?

https://arxiv.org/abs/2602.01122
21•bikenaga•3d ago•11 comments

How to effectively write quality code with AI

https://heidenstedt.org/posts/2026/how-to-effectively-write-quality-code-with-ai/
262•i5heu•18h ago•208 comments

Delimited Continuations vs. Lwt for Threads

https://mirageos.org/blog/delimcc-vs-lwt
33•romes•4d ago•3 comments

Female Asian Elephant Calf Born at the Smithsonian National Zoo

https://www.si.edu/newsdesk/releases/female-asian-elephant-calf-born-smithsonians-national-zoo-an...
38•gmays•10h ago•13 comments

Introducing the Developer Knowledge API and MCP Server

https://developers.googleblog.com/introducing-the-developer-knowledge-api-and-mcp-server/
61•gfortaine•12h ago•26 comments

I now assume that all ads on Apple news are scams

https://kirkville.com/i-now-assume-that-all-ads-on-apple-news-are-scams/
1073•cdrnsf•1d ago•460 comments

The AI boom is causing shortages everywhere else

https://www.washingtonpost.com/technology/2026/02/07/ai-spending-economy-shortages/
13•1vuio0pswjnm7•1h ago•0 comments

Understanding Neural Network, Visually

https://visualrambling.space/neural-network/
294•surprisetalk•3d ago•44 comments

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

https://infisical.com/blog/devops-to-solutions-engineering
152•vmatsiiako•20h ago•72 comments

Why I Joined OpenAI

https://www.brendangregg.com/blog/2026-02-07/why-i-joined-openai.html
157•SerCe•11h ago•143 comments

Learning from context is harder than we thought

https://hy.tencent.com/research/100025?langVersion=en
187•limoce•3d ago•102 comments
Open in hackernews

From text to token: How tokenization pipelines work

https://www.paradedb.com/blog/when-tokenization-becomes-token
134•philippemnoel•1mo ago

Comments

wongarsu•1mo ago
Notably tokenization for traditional search. LLMs use very different tokenization with very different goals
tgv•1mo ago
It's a rather old-fashioned style of tokenization. In the 1980s this was common, I think. But, as noted in another comment, it doesn't work that well for languages with a richer morphology, or compounding. It's a very "English" approach.
jamesgresql•1mo ago
Chinese, Japanese, Korean etc.. don’t work like this either.

However, even though the approach is “old fashioned” it’s still widely used for English. I’m not sure there is a universal approach that semantic search could use that would be both fast and accurate?

At the end of the day people choose a tokenizer that matches their language.

I will update the article to make all this clearer though!

empiko•1mo ago
This was common even in 2015. You can still see people removing stop words from text, even when they feed it to LLMs. It's of course terrible for performance, but old habits die hard I guess.
jamesgresql•1mo ago
100%, maybe we should do a follow up on other types of tokenization.
semicognitive•1mo ago
ParadeDB is a great team, highly recommend using
the_arun•1mo ago
Just curious - if we remove stop words from prompts before going to LLM, wouldn't it reduce token size? Will it keep the response from LLM same (original vs without stop tokens)?
kylecazar•1mo ago
Search engines can afford to throw out stopwords because they're often keyword based. But (frontier) LLM's need the nuance and semantics they signal -- they don't automatically strip them. There are probably special purpose models that do this, or in certain parts of a RAG pipeline, but that's the exception.

Yeah, it'll be less input tokens if you omitted them yourself. It's not guaranteed to keep the response the same, though. You're asking the model to work with less context and more ambiguity at that point. So stripping your prompt of stopwords is going to save you negligible $ and potentially cost a lot in model performance.

cubefox•1mo ago
Don't know, but GPT-5 Thinking strips out a lot of words in its reasoning trace in order to save tokens. Someone on Twitter jailbroke it in order to get the original CoT traces.
gortok•1mo ago
My biggest complaints about search come from day-to-day uses:

I use search in my email pretty heavily, and I’m most interested in specific words in the email; and when those emails are from specific folks or a specific domain. But, the mobile version of Gmail produces different results than the mobile Outlook app than the desktop version of Gmail, and all of them are pretty terrible at search as it pertains to email.

I have a hard to getting them to pull up emails in search that I know exist, that I know have certain words, and I know have certain email addresses in the body.

I recognize a generalized searching mechanisms is going to get domain specific nuances wrong, but is it really so hard to make a search engine that works on email and email based attachments that no one cares enough to try?

mattnewton•1mo ago
Huh, maybe your use case is around the indexing of the contents of attachments? I basically never search for the contents of attachments, just the clip does of emails, and have found gmail search to be really good. I switched back to the web client from Mac’s native mail app for this reason because search has been so good for me in Gmail.

I haven’t looked, but I wonder if there is a good hackable email client that will let you substitute out the search index with a reasonable abstraction from all the complicated email protocol stuff. I feel like building an index for your use case is totally achievable if so.

heikkilevanto•1mo ago
Good explanation on tokenizing English text for regular search. But it is far from universal, and will not work well in Finnish, for example.

Folding diacritics makes "vähä" (little) into "vaha" (wax).

Dropping stop words like "The" misses the word for "tea" (in rather old-fashioned finnish, but also in current Danish).

Stemming Finnish words is also much more complex, as we tend to append suffixes to the words instead of small words in front to the word. "talo" is "house", "talosta" is "from the house", "talostani" is "from my house", and "talostaniko" makes it a question "from my house?"

If that sounds too easy, consider Japanese. From what little I know they don't use whitespace to separate words, mix two phonetic alphabets with Chinese ideograms, etc.

philippemnoel•1mo ago
That's true. For this reason, most modern search engines support language-aware stemming and tokenization. Popular tokenizers for CJK languages include Lindera and Jieba.

We (ParadeDB) use a search library called Tantivy under the hood, which supports stemming in Finnish, Danish and many other languages: https://docs.paradedb.com/documentation/token-filters/stemmi...

ashirviskas•1mo ago
Yep and I find that this really worsens LLM performance. For example `Ben,Alice` would be tokenized as `Ben|,A|lice`. And having to connect `lice` to the name `Alice` does not make it any easier for LLMs. However, formatting it as `Ben, Alice` tokenizes it as `Ben|,| Alice`. I found it kind of useful to improve performance by just formatting the data a bit differently.

I actually just started working on a data formatter that applies principles like these to drastically reduce the amount of tokens without decreasing the performance, like other formats do (looking at you, tson).

nawazgafar•1mo ago
You beat me to the punch. I wrote a blog post[1] with the exact same title last week! Though, I went into a bit more detail with regard to embedding layers, so maybe my title is not accurate.

1. https://gafar.org/blog/text-to-tokens

jamesgresql•1mo ago
Amazing, will have a read!
flakiness•1mo ago
Oh it's good old tokenization vs for-LLM tokenizations like sentence piece or tiktoken. We shouldn't forget there are non-ML simple things like this one which doesn't ask you to buy more GPUs.
jamesgresql•1mo ago
Haha, I like “good old tokenization”
6r17•1mo ago
I'm wondering if the english stopwords are not children of a forgotten declination that was forgotten from the language - ... ok so I had to check this out but I don't really have time to check more than with gemini - apparently - The word "the" is basically the sole survivor of a massive, complex table of declensions. In Old English, you could not just say "the." You had to choose the correct word based on gender, case, and number—exactly like you do in Polish today with ten, ta, to, tego, temu, tej, etc.

The Old English "The" (Definite Article) Case Masculine (Ten) Neuter (To) Feminine (Ta) Plural (Te) Nominative Se Þæt Sēo Þā Accusative Þone Þæt Þā Þā Genitive Þæs Þæs Þære Þāra Dative Þæm Þæm Þære Þæm Instrumental Þy Þy — —

I have read somewhere that polish was actually more precise language to be used with AI - I'm wondering if the idea of shortening words that apparently make no sense are not actually hurting it more - as noticed by the article though.

So I'm to wonder at this point - wouldn't it be worthy of exploring a tenser version of the language that might bridge that gap ? completely exploratory though I don't even know if that might be helpful idea other than being a toy

zk0•1mo ago
tl;dr with a match statement
anonymoushn•1mo ago
i love stemming, i love searching for "anime" and getting "animal"