frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

This is not the future

https://blog.mathieui.net/this-is-not-the-future.html
190•ericdanielski•1h ago•68 comments

40 percent of fMRI signals do not correspond to actual brain activity

https://www.tum.de/en/news-and-events/all-news/press-releases/details/40-percent-of-mri-signals-d...
108•geox•1h ago•39 comments

Rust GCC back end: Why and how

https://blog.guillaume-gomez.fr/articles/2025-12-15+Rust+GCC+backend%3A+Why+and+how
41•ahlCVA•1h ago•13 comments

Full Unicode Search at 50× ICU Speed with AVX‑512

https://ashvardanian.com/posts/search-utf8/
77•ashvardanian•22h ago•37 comments

I don't think Lindley's paradox supports p-circling

https://vilgot-huhn.github.io/mywebsite/posts/20251206_p_circle_lindley/
15•speckx•1h ago•1 comments

You're overspending because you lack values

https://www.sherryning.com/p/youre-overspending-because-you-lack-values
32•speckx•1h ago•10 comments

Put a ring on it: a lock-free MPMC ring buffer

https://h4x0r.org/ring/
30•signa11•1h ago•12 comments

SHARP, an approach to photorealistic view synthesis from a single image

https://apple.github.io/ml-sharp/
405•dvrp•11h ago•93 comments

A2UI: A Protocol for Agent-Driven Interfaces

https://a2ui.org/
96•makeramen•5h ago•28 comments

Sega Channel: VGHF Recovers over 100 Sega Channel ROMs (and More)

https://gamehistory.org/segachannel/
23•wicket•2h ago•1 comments

Children with cancer scammed out of millions fundraised for their treatment

https://www.bbc.com/news/articles/ckgz318y8elo
411•1659447091•8h ago•330 comments

Cekura (YC F24) Is Hiring

https://www.ycombinator.com/companies/cekura-ai/jobs/YFeQADI-product-engineer-us
1•atarus•3h ago

Be Careful with GIDs in Rails

https://blog.julik.nl/2025/12/a-trap-with-global-ids
23•julik•5d ago•11 comments

Quill OS: An open-source OS for Kobo's eReaders

https://quill-os.org/
358•Curiositry•14h ago•116 comments

Bonsai: A Voxel Engine, from scratch

https://github.com/scallyw4g/bonsai
138•jesse__•9h ago•24 comments

ArkhamMirror: Airgapped investigation platform with CIA-style hypothesis testing

https://github.com/mantisfury/ArkhamMirror
58•ArkhamMirror•5h ago•26 comments

A brief history of Times New Roman

https://typographyforlawyers.com/a-brief-history-of-times-new-roman.html
15•tosh•1h ago•2 comments

Purrtran – ᓚᘏᗢ – A Programming Language for Cat People

https://github.com/cmontella/purrtran
19•simonpure•2d ago•2 comments

High Performance SSH/SCP

https://www.psc.edu/hpn-ssh-home/
47•gslin•5d ago•22 comments

Mozilla's new CEO is doubling down on an AI future for Firefox

https://www.theverge.com/tech/845216/mozilla-ceo-anthony-enzor-demeo
10•latexr•27m ago•8 comments

A linear-time alternative for Dimensionality Reduction and fast visualisation

https://medium.com/@roman.f/a-linear-time-alternative-to-t-sne-for-dimensionality-reduction-and-f...
86•romanfll•8h ago•28 comments

Erdős Problem #1026

https://terrytao.wordpress.com/2025/12/08/the-story-of-erdos-problem-126/
126•tzury•10h ago•18 comments

“Are you the one?” is free money

https://blog.owenlacey.dev/posts/are-you-the-one-is-free-money/
406•samwho•4d ago•99 comments

Internal RFCs saved us months of wasted work

https://highimpactengineering.substack.com/p/the-illusion-of-shared-understanding
72•romannikolaev•5d ago•50 comments

8M users' AI conversations sold for profit by "privacy" extensions

https://www.koi.ai/blog/urban-vpn-browser-extension-ai-conversations-data-collection
647•takira•12h ago•209 comments

Creating C closures from Lua closures

https://lowkpro.com/blog/creating-c-closures-from-lua-closures.html
47•publicdebates•4d ago•11 comments

Mathematicians Crack a Fractal Conjecture on Chaos

https://www.scientificamerican.com/article/mathematicians-crack-a-fractal-conjecture-on-chaos/
4•mikhael•6d ago•3 comments

Native vs. emulation: World of Warcraft game performance on Snapdragon X Elite

https://rkblog.dev/posts/pc-hardware/pc-on-arm/x86_versus_arm_native_game/
92•geekman7473•15h ago•42 comments

Show HN: I designed my own 3D printer motherboard

https://github.com/KaiPereira/Cheetah-MX4-Mini
98•kaipereira•1w ago•26 comments

Economics of Orbital vs. Terrestrial Data Centers

https://andrewmccalip.com/space-datacenters
155•flinner•17h ago•208 comments
Open in hackernews

Full Unicode Search at 50× ICU Speed with AVX‑512

https://ashvardanian.com/posts/search-utf8/
76•ashvardanian•22h ago

Comments

andersa•2h ago
From a German user perspective, ICU and your fancy library are incorrect, actually. Mass is not a different casing of Maß, they are different characters. Google likely changed this because it didn't do what users wanted.
b2ccb2•2h ago
The confusion likely stems from the relatively new introduction of the capitalized ẞ https://de.wikipedia.org/wiki/Gro%C3%9Fes_%C3%9F

Maß capitalized (used to be) MASS.

Funnily enough, Mass means one liter beer (think Oktoberfest).

andersa•1h ago
It's strange, because I would expect "maß" as the case insensitive search to match "MASS" in the search text, but "mass" should not match "Maß".
looperhacks•1h ago
Both Maß and Mass are valid spellings for a liter of beer ;) Not to confuse it with Maß, which just means any measurement, of course.
looperhacks•1h ago
MASS is allowed casing of Maß, but not the preferred casing: https://www.rechtschreibrat.com/DOX/RfdR_Amtliches-Regelwerk... Page 48
pjmlp•1h ago
It isn't until it is, how would you write it when ß isn't available on the keyboard?

Which is why we also have to deal with the ue, ae, oe kind of trick, also known as Ersatzschreibweise.

Then German language users from de-CH region, consider Mass the correct way.

Yeah, localization and internalization is a mess to get right.

wat10000•1h ago
Case insensitivity is localized like anything else. I and i are equivalent, right? Not if you’re doing Turkish, then it’s I and ı, and İ and i.

In practice you can do pretty well with a universal approach, but it can’t be 100% correct.

ashvardanian•56m ago
This is a very good example! Still, “correct” needs context. You can be 100% “correct with respect to ICU”. It’s definitely not perfect, but it’s the best standard we have. And luckily for me, it also defines the locale-independent rules. I can expand to support locale-specific adjustments in the future, but waiting for the adoption to grow before investing even more engineering effort into this feature. Maybe worth opening a GitHub issue for that :)
wat10000•49m ago
Right, nothing wrong with delegating the decision to a bunch of people who have thought long and hard about the best compromise, as long as it’s understood that it’s not perfect.
Arnt•1h ago
Ah, let's have a long discussion of this.

Unicode avoids "different" and "same", https://www.unicode.org/reports/tr15/ uses phrases like compatibility equivalence.

The whole thing is complicated, because it actually is complicated in the real world. You can spell the name of Gießen "Giessen" and most Germans consider it correct even if not ideal, but spelling Massachusetts "Maßachusetts" is plainly wrong in German text. The relationship between ß and ss isn't symmetric. Unicode captures that complexity, when you get into the fine details.

mxmlnkn•1h ago
I never understood why the recommended replacement for ß is ss. It is a ligature of sz (similar to & being a ligature of et) and is even pronounced ess-zet. The only logical replacement would have been sz, and it would have avoided the clash of Masse (mass) and Maße (measurements). Then again, it only affects whether the vowel before it is pronounced short or long, and there are better ways to encode that in written language in the first place.
mgaunard•2h ago
In practice you should always normalize your Unicode data, then all you need to do is memcmp + boundary check.

Interestingly enough this library doesn't provide grapheme cluster tokenization and/or boundary checking which is one of the most useful primitive for this.

stingraycharles•2h ago
That’s not practical in many situations, as the normalization alone may very well be more expensive than the search.

If you’re in control of all data representations in your entire stack, then yes of course, but that’s hardly ever the case and different tradeoffs are made at different times (eg storage in UTF-8 because of efficiency, but in-memory representation in UTF-32 because of speed).

mgaunard•2h ago
That doesn't make sense; the search is doing on-the-fly normalization as part of its algorithm, so it cannot be faster than normalization alone.
stingraycharles•1h ago
It can, because of how CPUs work with registers and hot code paths and all that.

First normalizing everything and then comparing normalized versions isn’t as fast.

And it also enables “stopping early” when a match has been found / not found, you may not actually have to convert everything.

mgaunard•51m ago
Running more code per unit of data does not make the code hotter or reduce the register pressure, quite the opposite...
stingraycharles•44m ago
You’re misunderstanding: you just convert to 32 bits once and reuse that same register all the time.

You’re running the exact same code, but are more more efficient in terms of “I immediately use the data for comparison after converting it”, which means it’s likely either in a register or L1 cache already.

ashvardanian•1h ago
I get why it sounds that way, but it’s not actually true.

StringZilla added full Unicode case folding in an earlier release, and had a state-of-the-art exact case-sensitive substring search for years. However, doing a full fold of the entire haystack is significantly slower than the new case-insensitive search path.

The key point is that you don’t need to fully normalize the haystack to correctly answer most substring queries. The search algorithm can rule out the vast majority of positions using cheap, SIMD-friendly probes and only apply fold logic on a very small subset of candidates.

I go into the details in the “Ideation & Challenges in Substring Search” section of the article

Const-me•1h ago
> it cannot be faster than normalization alone

Modern processors are generally computing stuff way faster than they can load and store bytes from main memory.

The code which does on the fly normalization only needs to normalize a small window. If you’re careful, you can even keep that window in registers, which have single CPU cycle access latency and ridiculously high throughput like 500GB/sec. Even if you have to store and reload, on-the-fly normalization is likely to handle tiny windows which fit in the in-core L1D cache. The access cost for L1D is like ~5 cycles of latency, and equally high throughput because many modern processors can load two 64-bytes vectors and store one vector each and every cycle.

mgaunard•49m ago
The author published the bandwidth of its algo, it's one fifth of a typical memory bandwidth (it's not possible to go faster than memory obviously for this benchmark, since we're assuming the data is not in cache).
orthoxerox•2h ago
In practice the data is not always yours to normalize. You're not going to case-fold your library, but you still want to be able to search it.
orthoxerox•1h ago
Is it possible to extend this to support additional transformation rules like Any-Latin;Latin-ASCII? To make it possible to find "Վարդանյան" in a haystack by searching for "vardanyan"?
ashvardanian•1h ago
Yes — fuzzy and phonetic matching across languages is part of the roadmap. That space is still poorly standardized, so I wanted to start with something widely understood and well-defined (ICU-style transforms) before layering on more advanced behavior.

Also, as shown in the later tables, the Armenian and Georgian fast paths still have room for improvement. Before introducing higher-level APIs, I need to tighten the existing Armenian kernel and add a dedicated one for Georgian. It’s not a true bicameral script, but some characters are folding fold targets for older scripts, which currently forces too many fallbacks to the serial path.

unwind•1h ago
Very cool and impressive performance.

I was worried (I find it confusing when Unicode "shadows" of normal letters exist, and those are of course also dangerous in some cases when they can be mis-interpreted for the letter they look more or less exactly like) by the article's use of U+212A (Kelvin symbol) as sample text, so I had to look it up [1].

Anyway, according to Wikipedia the dedicated symbol should not be used:

However, this is a compatibility character provided for compatibility with legacy encodings. The Unicode standard recommends using U+004B K LATIN CAPITAL LETTER K instead; that is, a normal capital K.

That was comforting, to me. :)

[1]: https://en.wikipedia.org/wiki/Kelvin#Orthography

jjmarr•1h ago
> I find it confusing when Unicode "shadows" of normal letters exist, and those are of course also dangerous in some cases when they can be mis-interpreted for the letter they look more or less exactly like

Isn't this why Unicode normalization exists? This would let you compare Unicode letters and determine if they are canonically equivalent.

ComputerGuru•28m ago
Normalization wouldn’t address this.
happytoexplain•19m ago
What do you mean? All four normal forms of the Kelvin 'K' are the Latin 'K', as far as I can tell.
nwellnhof•15m ago
Normalization forms NFKC and NFKD that also handle compatibility equivalence do.
xking5205•1h ago
its good
ashvardanian•1h ago
This article is about the ugliest — but arguably the most important — piece of open-source software I’ve written this year. The write-up ended up long and dense, so here’s a short TL;DR:

I grouped all Unicode 17 case-folding rules and built ~3K lines of AVX-512 kernels around them to enable fully standards-compliant, case-insensitive substring search across the entire 1M+ Unicode range, operating directly on UTF-8 bytes. In practice, this is often ~50× faster than ICU, and also less wrong than most tools people rely on today—from grep-style utilities to products like Google Docs, Microsoft Excel, and VS Code.

StringZilla v4.5 is available for C99, C++11, Python 3, Rust, Swift, Go, and JavaScript. The article covers the algorithmic tradeoffs, benchmarks across 20+ Wikipedia dumps in different languages, and quick starts for each binding.

Thanks to everyone for feature requests and bug reports. I'll do my best to port this to Arm as well — but first, I'm trying to ship one more thing before year's end.

fatty_patty89•1h ago
Thank you

do the go bindings require cgo?

ashvardanian•1h ago
The GoLang bindings – yes, they are based on cGo. I realize it's suboptimal, but seems like the only practical option at this point.
fatty_patty89•1h ago
In a normal world the Go C FFI wouldn't have insane overhead but what can we do, the language is perfect and it will stay that way until morale improves.

Thanks for the work you do

kardianos•29m ago
In a real (not "normal") world, trade-offs exist and Go choose a specific set of design points that are consequential.
adzm•49m ago
This is a truly amazing accomplishment. Reading these kernels is a joy!
kardianos•21m ago
Looks neat. What are all the genomic sequence comparisons in there for? Is this a grab bag of interesting string methods or is there a motivation for this?
ashvardanian•18m ago
Levenshtein distance calculations are a pretty generic string operation, Genomics happens to be one of the domains where they are most used... and a passion of mine :)