Interestingly enough this library doesn't provide grapheme cluster tokenization and/or boundary checking which is one of the most useful primitive for this.
If you’re in control of all data representations in your entire stack, then yes of course, but that’s hardly ever the case and different tradeoffs are made at different times (eg storage in UTF-8 because of efficiency, but in-memory representation in UTF-32 because of speed).
First normalizing everything and then comparing normalized versions isn’t as fast.
And it also enables “stopping early” when a match has been found / not found, you may not actually have to convert everything.
You’re running the exact same code, but are more more efficient in terms of “I immediately use the data for comparison after converting it”, which means it’s likely either in a register or L1 cache already.
StringZilla added full Unicode case folding in an earlier release, and had a state-of-the-art exact case-sensitive substring search for years. However, doing a full fold of the entire haystack is significantly slower than the new case-insensitive search path.
The key point is that you don’t need to fully normalize the haystack to correctly answer most substring queries. The search algorithm can rule out the vast majority of positions using cheap, SIMD-friendly probes and only apply fold logic on a very small subset of candidates.
I go into the details in the “Ideation & Challenges in Substring Search” section of the article
Modern processors are generally computing stuff way faster than they can load and store bytes from main memory.
The code which does on the fly normalization only needs to normalize a small window. If you’re careful, you can even keep that window in registers, which have single CPU cycle access latency and ridiculously high throughput like 500GB/sec. Even if you have to store and reload, on-the-fly normalization is likely to handle tiny windows which fit in the in-core L1D cache. The access cost for L1D is like ~5 cycles of latency, and equally high throughput because many modern processors can load two 64-bytes vectors and store one vector each and every cycle.
Also, as shown in the later tables, the Armenian and Georgian fast paths still have room for improvement. Before introducing higher-level APIs, I need to tighten the existing Armenian kernel and add a dedicated one for Georgian. It’s not a true bicameral script, but some characters are folding fold targets for older scripts, which currently forces too many fallbacks to the serial path.
I was worried (I find it confusing when Unicode "shadows" of normal letters exist, and those are of course also dangerous in some cases when they can be mis-interpreted for the letter they look more or less exactly like) by the article's use of U+212A (Kelvin symbol) as sample text, so I had to look it up [1].
Anyway, according to Wikipedia the dedicated symbol should not be used:
However, this is a compatibility character provided for compatibility with legacy encodings. The Unicode standard recommends using U+004B K LATIN CAPITAL LETTER K instead; that is, a normal capital K.
That was comforting, to me. :)
Isn't this why Unicode normalization exists? This would let you compare Unicode letters and determine if they are canonically equivalent.
I grouped all Unicode 17 case-folding rules and built ~3K lines of AVX-512 kernels around them to enable fully standards-compliant, case-insensitive substring search across the entire 1M+ Unicode range, operating directly on UTF-8 bytes. In practice, this is often ~50× faster than ICU, and also less wrong than most tools people rely on today—from grep-style utilities to products like Google Docs, Microsoft Excel, and VS Code.
StringZilla v4.5 is available for C99, C++11, Python 3, Rust, Swift, Go, and JavaScript. The article covers the algorithmic tradeoffs, benchmarks across 20+ Wikipedia dumps in different languages, and quick starts for each binding.
Thanks to everyone for feature requests and bug reports. I'll do my best to port this to Arm as well — but first, I'm trying to ship one more thing before year's end.
do the go bindings require cgo?
Thanks for the work you do
andersa•2h ago
b2ccb2•2h ago
Maß capitalized (used to be) MASS.
Funnily enough, Mass means one liter beer (think Oktoberfest).
andersa•1h ago
looperhacks•1h ago
looperhacks•1h ago
pjmlp•1h ago
Which is why we also have to deal with the ue, ae, oe kind of trick, also known as Ersatzschreibweise.
Then German language users from de-CH region, consider Mass the correct way.
Yeah, localization and internalization is a mess to get right.
wat10000•1h ago
In practice you can do pretty well with a universal approach, but it can’t be 100% correct.
ashvardanian•56m ago
wat10000•49m ago
Arnt•1h ago
Unicode avoids "different" and "same", https://www.unicode.org/reports/tr15/ uses phrases like compatibility equivalence.
The whole thing is complicated, because it actually is complicated in the real world. You can spell the name of Gießen "Giessen" and most Germans consider it correct even if not ideal, but spelling Massachusetts "Maßachusetts" is plainly wrong in German text. The relationship between ß and ss isn't symmetric. Unicode captures that complexity, when you get into the fine details.
mxmlnkn•1h ago