frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

France's homegrown open source online office suite

https://github.com/suitenumerique
445•nar001•4h ago•213 comments

British drivers over 70 to face eye tests every three years

https://www.bbc.com/news/articles/c205nxy0p31o
144•bookofjoe•1h ago•122 comments

Start all of your commands with a comma (2009)

https://rhodesmill.org/brandon/2009/commands-with-comma/
442•theblazehen•2d ago•158 comments

Leisure Suit Larry's Al Lowe on model trains, funny deaths and Disney

https://spillhistorie.no/2026/02/06/interview-with-sierra-veteran-al-lowe/
27•thelok•2h ago•2 comments

Hoot: Scheme on WebAssembly

https://www.spritely.institute/hoot/
89•AlexeyBrin•5h ago•17 comments

Software Factories and the Agentic Moment

https://factory.strongdm.ai/
26•mellosouls•2h ago•17 comments

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

https://openciv3.org/
780•klaussilveira•20h ago•241 comments

First Proof

https://arxiv.org/abs/2602.05192
41•samasblack•2h ago•26 comments

Stories from 25 Years of Software Development

https://susam.net/twenty-five-years-of-computing.html
35•vinhnx•3h ago•4 comments

Reinforcement Learning from Human Feedback

https://arxiv.org/abs/2504.12501
58•onurkanbkrc•4h ago•3 comments

The Waymo World Model

https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-frontier-for-autonomous-driving-simula...
1030•xnx•1d ago•583 comments

Coding agents have replaced every framework I used

https://blog.alaindichiappari.dev/p/software-engineering-is-back
176•alainrk•4h ago•240 comments

Vocal Guide – belt sing without killing yourself

https://jesperordrup.github.io/vocal-guide/
168•jesperordrup•10h ago•64 comments

A Fresh Look at IBM 3270 Information Display System

https://www.rs-online.com/designspark/a-fresh-look-at-ibm-3270-information-display-system
25•rbanffy•4d ago•5 comments

72M Points of Interest

https://tech.marksblogg.com/overture-places-pois.html
16•marklit•5d ago•0 comments

StrongDM's AI team build serious software without even looking at the code

https://simonwillison.net/2026/Feb/7/software-factory/
19•simonw•2h ago•19 comments

Unseen Footage of Atari Battlezone Arcade Cabinet Production

https://arcadeblogger.com/2026/02/02/unseen-footage-of-atari-battlezone-cabinet-production/
105•videotopia•4d ago•27 comments

Vinklu Turns Forgotten Plot in Bucharest into Tiny Coffee Shop

https://design-milk.com/vinklu-turns-forgotten-plot-in-bucharest-into-tiny-coffee-shop/
6•surprisetalk•5d ago•0 comments

What Is Stoicism?

https://stoacentral.com/guides/what-is-stoicism
4•0xmattf•1h ago•1 comments

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

https://github.com/valdanylchuk/breezydemo
265•isitcontent•20h ago•33 comments

Making geo joins faster with H3 indexes

https://floedb.ai/blog/how-we-made-geo-joins-400-faster-with-h3-indexes
152•matheusalmeida•2d ago•42 comments

Monty: A minimal, secure Python interpreter written in Rust for use by AI

https://github.com/pydantic/monty
278•dmpetrov•20h ago•148 comments

Ga68, a GNU Algol 68 Compiler

https://fosdem.org/2026/schedule/event/PEXRTN-ga68-intro/
35•matt_d•4d ago•10 comments

Hackers (1995) Animated Experience

https://hackers-1995.vercel.app/
546•todsacerdoti•1d ago•263 comments

Sheldon Brown's Bicycle Technical Info

https://www.sheldonbrown.com/
420•ostacke•1d ago•110 comments

Show HN: I spent 4 years building a UI design tool with only the features I use

https://vecti.com
364•vecti•22h ago•165 comments

What Is Ruliology?

https://writings.stephenwolfram.com/2026/01/what-is-ruliology/
65•helloplanets•4d ago•69 comments

Show HN: If you lose your memory, how to regain access to your computer?

https://eljojo.github.io/rememory/
338•eljojo•23h ago•209 comments

An Update on Heroku

https://www.heroku.com/blog/an-update-on-heroku/
457•lstoll•1d ago•303 comments

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

https://github.com/sandys/kappal
16•sandGorgon•2d ago•4 comments
Open in hackernews

From text to token: How tokenization pipelines work

https://www.paradedb.com/blog/when-tokenization-becomes-token
134•philippemnoel•1mo ago

Comments

wongarsu•1mo ago
Notably tokenization for traditional search. LLMs use very different tokenization with very different goals
tgv•1mo ago
It's a rather old-fashioned style of tokenization. In the 1980s this was common, I think. But, as noted in another comment, it doesn't work that well for languages with a richer morphology, or compounding. It's a very "English" approach.
jamesgresql•1mo ago
Chinese, Japanese, Korean etc.. don’t work like this either.

However, even though the approach is “old fashioned” it’s still widely used for English. I’m not sure there is a universal approach that semantic search could use that would be both fast and accurate?

At the end of the day people choose a tokenizer that matches their language.

I will update the article to make all this clearer though!

empiko•1mo ago
This was common even in 2015. You can still see people removing stop words from text, even when they feed it to LLMs. It's of course terrible for performance, but old habits die hard I guess.
jamesgresql•1mo ago
100%, maybe we should do a follow up on other types of tokenization.
semicognitive•1mo ago
ParadeDB is a great team, highly recommend using
the_arun•1mo ago
Just curious - if we remove stop words from prompts before going to LLM, wouldn't it reduce token size? Will it keep the response from LLM same (original vs without stop tokens)?
kylecazar•1mo ago
Search engines can afford to throw out stopwords because they're often keyword based. But (frontier) LLM's need the nuance and semantics they signal -- they don't automatically strip them. There are probably special purpose models that do this, or in certain parts of a RAG pipeline, but that's the exception.

Yeah, it'll be less input tokens if you omitted them yourself. It's not guaranteed to keep the response the same, though. You're asking the model to work with less context and more ambiguity at that point. So stripping your prompt of stopwords is going to save you negligible $ and potentially cost a lot in model performance.

cubefox•1mo ago
Don't know, but GPT-5 Thinking strips out a lot of words in its reasoning trace in order to save tokens. Someone on Twitter jailbroke it in order to get the original CoT traces.
gortok•1mo ago
My biggest complaints about search come from day-to-day uses:

I use search in my email pretty heavily, and I’m most interested in specific words in the email; and when those emails are from specific folks or a specific domain. But, the mobile version of Gmail produces different results than the mobile Outlook app than the desktop version of Gmail, and all of them are pretty terrible at search as it pertains to email.

I have a hard to getting them to pull up emails in search that I know exist, that I know have certain words, and I know have certain email addresses in the body.

I recognize a generalized searching mechanisms is going to get domain specific nuances wrong, but is it really so hard to make a search engine that works on email and email based attachments that no one cares enough to try?

mattnewton•1mo ago
Huh, maybe your use case is around the indexing of the contents of attachments? I basically never search for the contents of attachments, just the clip does of emails, and have found gmail search to be really good. I switched back to the web client from Mac’s native mail app for this reason because search has been so good for me in Gmail.

I haven’t looked, but I wonder if there is a good hackable email client that will let you substitute out the search index with a reasonable abstraction from all the complicated email protocol stuff. I feel like building an index for your use case is totally achievable if so.

heikkilevanto•1mo ago
Good explanation on tokenizing English text for regular search. But it is far from universal, and will not work well in Finnish, for example.

Folding diacritics makes "vähä" (little) into "vaha" (wax).

Dropping stop words like "The" misses the word for "tea" (in rather old-fashioned finnish, but also in current Danish).

Stemming Finnish words is also much more complex, as we tend to append suffixes to the words instead of small words in front to the word. "talo" is "house", "talosta" is "from the house", "talostani" is "from my house", and "talostaniko" makes it a question "from my house?"

If that sounds too easy, consider Japanese. From what little I know they don't use whitespace to separate words, mix two phonetic alphabets with Chinese ideograms, etc.

philippemnoel•1mo ago
That's true. For this reason, most modern search engines support language-aware stemming and tokenization. Popular tokenizers for CJK languages include Lindera and Jieba.

We (ParadeDB) use a search library called Tantivy under the hood, which supports stemming in Finnish, Danish and many other languages: https://docs.paradedb.com/documentation/token-filters/stemmi...

ashirviskas•1mo ago
Yep and I find that this really worsens LLM performance. For example `Ben,Alice` would be tokenized as `Ben|,A|lice`. And having to connect `lice` to the name `Alice` does not make it any easier for LLMs. However, formatting it as `Ben, Alice` tokenizes it as `Ben|,| Alice`. I found it kind of useful to improve performance by just formatting the data a bit differently.

I actually just started working on a data formatter that applies principles like these to drastically reduce the amount of tokens without decreasing the performance, like other formats do (looking at you, tson).

nawazgafar•1mo ago
You beat me to the punch. I wrote a blog post[1] with the exact same title last week! Though, I went into a bit more detail with regard to embedding layers, so maybe my title is not accurate.

1. https://gafar.org/blog/text-to-tokens

jamesgresql•1mo ago
Amazing, will have a read!
flakiness•1mo ago
Oh it's good old tokenization vs for-LLM tokenizations like sentence piece or tiktoken. We shouldn't forget there are non-ML simple things like this one which doesn't ask you to buy more GPUs.
jamesgresql•1mo ago
Haha, I like “good old tokenization”
6r17•1mo ago
I'm wondering if the english stopwords are not children of a forgotten declination that was forgotten from the language - ... ok so I had to check this out but I don't really have time to check more than with gemini - apparently - The word "the" is basically the sole survivor of a massive, complex table of declensions. In Old English, you could not just say "the." You had to choose the correct word based on gender, case, and number—exactly like you do in Polish today with ten, ta, to, tego, temu, tej, etc.

The Old English "The" (Definite Article) Case Masculine (Ten) Neuter (To) Feminine (Ta) Plural (Te) Nominative Se Þæt Sēo Þā Accusative Þone Þæt Þā Þā Genitive Þæs Þæs Þære Þāra Dative Þæm Þæm Þære Þæm Instrumental Þy Þy — —

I have read somewhere that polish was actually more precise language to be used with AI - I'm wondering if the idea of shortening words that apparently make no sense are not actually hurting it more - as noticed by the article though.

So I'm to wonder at this point - wouldn't it be worthy of exploring a tenser version of the language that might bridge that gap ? completely exploratory though I don't even know if that might be helpful idea other than being a toy

zk0•1mo ago
tl;dr with a match statement
anonymoushn•1mo ago
i love stemming, i love searching for "anime" and getting "animal"