frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Start all of your commands with a comma

https://rhodesmill.org/brandon/2009/commands-with-comma/
193•theblazehen•2d ago•56 comments

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

https://openciv3.org/
679•klaussilveira•14h ago•203 comments

The Waymo World Model

https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-frontier-for-autonomous-driving-simula...
954•xnx•20h ago•552 comments

How we made geo joins 400× faster with H3 indexes

https://floedb.ai/blog/how-we-made-geo-joins-400-faster-with-h3-indexes
125•matheusalmeida•2d ago•33 comments

Jeffrey Snover: "Welcome to the Room"

https://www.jsnover.com/blog/2026/02/01/welcome-to-the-room/
25•kaonwarb•3d ago•21 comments

Unseen Footage of Atari Battlezone Arcade Cabinet Production

https://arcadeblogger.com/2026/02/02/unseen-footage-of-atari-battlezone-cabinet-production/
62•videotopia•4d ago•2 comments

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

https://github.com/valdanylchuk/breezydemo
235•isitcontent•15h ago•25 comments

Vocal Guide – belt sing without killing yourself

https://jesperordrup.github.io/vocal-guide/
39•jesperordrup•5h ago•17 comments

Monty: A minimal, secure Python interpreter written in Rust for use by AI

https://github.com/pydantic/monty
227•dmpetrov•15h ago•121 comments

Show HN: I spent 4 years building a UI design tool with only the features I use

https://vecti.com
332•vecti•17h ago•145 comments

Hackers (1995) Animated Experience

https://hackers-1995.vercel.app/
499•todsacerdoti•22h ago•243 comments

Sheldon Brown's Bicycle Technical Info

https://www.sheldonbrown.com/
384•ostacke•21h ago•96 comments

Microsoft open-sources LiteBox, a security-focused library OS

https://github.com/microsoft/litebox
360•aktau•21h ago•183 comments

Show HN: If you lose your memory, how to regain access to your computer?

https://eljojo.github.io/rememory/
292•eljojo•17h ago•182 comments

Where did all the starships go?

https://www.datawrapper.de/blog/science-fiction-decline
21•speckx•3d ago•10 comments

An Update on Heroku

https://www.heroku.com/blog/an-update-on-heroku/
413•lstoll•21h ago•279 comments

ga68, the GNU Algol 68 Compiler – FOSDEM 2026 [video]

https://fosdem.org/2026/schedule/event/PEXRTN-ga68-intro/
6•matt_d•3d ago•1 comments

Was Benoit Mandelbrot a hedgehog or a fox?

https://arxiv.org/abs/2602.01122
20•bikenaga•3d ago•10 comments

PC Floppy Copy Protection: Vault Prolok

https://martypc.blogspot.com/2024/09/pc-floppy-copy-protection-vault-prolok.html
66•kmm•5d ago•9 comments

Dark Alley Mathematics

https://blog.szczepan.org/blog/three-points/
93•quibono•4d ago•22 comments

How to effectively write quality code with AI

https://heidenstedt.org/posts/2026/how-to-effectively-write-quality-code-with-ai/
260•i5heu•17h ago•202 comments

Delimited Continuations vs. Lwt for Threads

https://mirageos.org/blog/delimcc-vs-lwt
33•romes•4d ago•3 comments

Female Asian Elephant Calf Born at the Smithsonian National Zoo

https://www.si.edu/newsdesk/releases/female-asian-elephant-calf-born-smithsonians-national-zoo-an...
38•gmays•10h ago•13 comments

I now assume that all ads on Apple news are scams

https://kirkville.com/i-now-assume-that-all-ads-on-apple-news-are-scams/
1073•cdrnsf•1d ago•459 comments

Introducing the Developer Knowledge API and MCP Server

https://developers.googleblog.com/introducing-the-developer-knowledge-api-and-mcp-server/
60•gfortaine•12h ago•26 comments

Understanding Neural Network, Visually

https://visualrambling.space/neural-network/
291•surprisetalk•3d ago•43 comments

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

https://infisical.com/blog/devops-to-solutions-engineering
150•vmatsiiako•19h ago•71 comments

The AI boom is causing shortages everywhere else

https://www.washingtonpost.com/technology/2026/02/07/ai-spending-economy-shortages/
8•1vuio0pswjnm7•1h ago•0 comments

Why I Joined OpenAI

https://www.brendangregg.com/blog/2026-02-07/why-i-joined-openai.html
154•SerCe•10h ago•144 comments

Learning from context is harder than we thought

https://hy.tencent.com/research/100025?langVersion=en
187•limoce•3d ago•102 comments
Open in hackernews

Full Unicode Search at 50× ICU Speed with AVX‑512

https://ashvardanian.com/posts/search-utf8/
217•ashvardanian•1mo ago

Comments

andersa•1mo ago
From a German user perspective, ICU and your fancy library are incorrect, actually. Mass is not a different casing of Maß, they are different characters. Google likely changed this because it didn't do what users wanted.
b2ccb2•1mo ago
The confusion likely stems from the relatively new introduction of the capitalized ẞ https://de.wikipedia.org/wiki/Gro%C3%9Fes_%C3%9F

Maß capitalized (used to be) MASS.

Funnily enough, Mass means one liter beer (think Oktoberfest).

andersa•1mo ago
It's strange, because I would expect "maß" as the case insensitive search to match "MASS" in the search text, but "mass" should not match "Maß".
janc_•1mo ago
I think all of those should be "tentative matches" for each other.
looperhacks•1mo ago
Both Maß and Mass are valid spellings for a liter of beer ;) Not to confuse it with Maß, which just means any measurement, of course.
looperhacks•1mo ago
MASS is allowed casing of Maß, but not the preferred casing: https://www.rechtschreibrat.com/DOX/RfdR_Amtliches-Regelwerk... Page 48
pjmlp•1mo ago
It isn't until it is, how would you write it when ß isn't available on the keyboard?

Which is why we also have to deal with the ue, ae, oe kind of trick, also known as Ersatzschreibweise.

Then German language users from de-CH region, consider Mass the correct way.

Yeah, localization and internalization is a mess to get right.

wat10000•1mo ago
Case insensitivity is localized like anything else. I and i are equivalent, right? Not if you’re doing Turkish, then it’s I and ı, and İ and i.

In practice you can do pretty well with a universal approach, but it can’t be 100% correct.

ashvardanian•1mo ago
This is a very good example! Still, “correct” needs context. You can be 100% “correct with respect to ICU”. It’s definitely not perfect, but it’s the best standard we have. And luckily for me, it also defines the locale-independent rules. I can expand to support locale-specific adjustments in the future, but waiting for the adoption to grow before investing even more engineering effort into this feature. Maybe worth opening a GitHub issue for that :)
wat10000•1mo ago
Right, nothing wrong with delegating the decision to a bunch of people who have thought long and hard about the best compromise, as long as it’s understood that it’s not perfect.
Arnt•1mo ago
Ah, let's have a long discussion of this.

Unicode avoids "different" and "same", https://www.unicode.org/reports/tr15/ uses phrases like compatibility equivalence.

The whole thing is complicated, because it actually is complicated in the real world. You can spell the name of Gießen "Giessen" and most Germans consider it correct even if not ideal, but spelling Massachusetts "Maßachusetts" is plainly wrong in German text. The relationship between ß and ss isn't symmetric. Unicode captures that complexity, when you get into the fine details.

mxmlnkn•1mo ago
I never understood why the recommended replacement for ß is ss. It is a ligature of sz (similar to & being a ligature of et) and is even pronounced ess-zet. The only logical replacement would have been sz, and it would have avoided the clash of Masse (mass) and Maße (measurements). Then again, it only affects whether the vowel before it is pronounced short or long, and there are better ways to encode that in written language in the first place.
adrian_b•1mo ago
I agree that writing it "sz" might have created less problems.

However, it is likely that it has never been pronounced "sz", but always "ss" and the habit of writing "sz" for the double consonant may have had the same reason as the writing of "ck" instead of the double "kk".

mgaunard•1mo ago
In practice you should always normalize your Unicode data, then all you need to do is memcmp + boundary check.

Interestingly enough this library doesn't provide grapheme cluster tokenization and/or boundary checking which is one of the most useful primitive for this.

stingraycharles•1mo ago
That’s not practical in many situations, as the normalization alone may very well be more expensive than the search.

If you’re in control of all data representations in your entire stack, then yes of course, but that’s hardly ever the case and different tradeoffs are made at different times (eg storage in UTF-8 because of efficiency, but in-memory representation in UTF-32 because of speed).

mgaunard•1mo ago
That doesn't make sense; the search is doing on-the-fly normalization as part of its algorithm, so it cannot be faster than normalization alone.
stingraycharles•1mo ago
It can, because of how CPUs work with registers and hot code paths and all that.

First normalizing everything and then comparing normalized versions isn’t as fast.

And it also enables “stopping early” when a match has been found / not found, you may not actually have to convert everything.

mgaunard•1mo ago
Running more code per unit of data does not make the code hotter or reduce the register pressure, quite the opposite...
stingraycharles•1mo ago
You’re misunderstanding: you just convert to 32 bits once and reuse that same register all the time.

You’re running the exact same code, but are more more efficient in terms of “I immediately use the data for comparison after converting it”, which means it’s likely either in a register or L1 cache already.

ashvardanian•1mo ago
I get why it sounds that way, but it’s not actually true.

StringZilla added full Unicode case folding in an earlier release, and had a state-of-the-art exact case-sensitive substring search for years. However, doing a full fold of the entire haystack is significantly slower than the new case-insensitive search path.

The key point is that you don’t need to fully normalize the haystack to correctly answer most substring queries. The search algorithm can rule out the vast majority of positions using cheap, SIMD-friendly probes and only apply fold logic on a very small subset of candidates.

I go into the details in the “Ideation & Challenges in Substring Search” section of the article

Const-me•1mo ago
> it cannot be faster than normalization alone

Modern processors are generally computing stuff way faster than they can load and store bytes from main memory.

The code which does on the fly normalization only needs to normalize a small window. If you’re careful, you can even keep that window in registers, which have single CPU cycle access latency and ridiculously high throughput like 500GB/sec. Even if you have to store and reload, on-the-fly normalization is likely to handle tiny windows which fit in the in-core L1D cache. The access cost for L1D is like ~5 cycles of latency, and equally high throughput because many modern processors can load two 64-bytes vectors and store one vector each and every cycle.

mgaunard•1mo ago
The author published the bandwidth of its algo, it's one fifth of a typical memory bandwidth (it's not possible to go faster than memory obviously for this benchmark, since we're assuming the data is not in cache).
orthoxerox•1mo ago
In practice the data is not always yours to normalize. You're not going to case-fold your library, but you still want to be able to search it.
orthoxerox•1mo ago
Is it possible to extend this to support additional transformation rules like Any-Latin;Latin-ASCII? To make it possible to find "Վարդանյան" in a haystack by searching for "vardanyan"?
ashvardanian•1mo ago
Yes — fuzzy and phonetic matching across languages is part of the roadmap. That space is still poorly standardized, so I wanted to start with something widely understood and well-defined (ICU-style transforms) before layering on more advanced behavior.

Also, as shown in the later tables, the Armenian and Georgian fast paths still have room for improvement. Before introducing higher-level APIs, I need to tighten the existing Armenian kernel and add a dedicated one for Georgian. It’s not a true bicameral script, but some characters are folding fold targets for older scripts, which currently forces too many fallbacks to the serial path.

janc_•1mo ago
Even when transliteration is somewhat de-facto standardised, it usually is dependent on the target/host language. So e.g. Arabic & Russian are transliterated differently in e.g. English, French, German, Dutch, etc.
unwind•1mo ago
Very cool and impressive performance.

I was worried (I find it confusing when Unicode "shadows" of normal letters exist, and those are of course also dangerous in some cases when they can be mis-interpreted for the letter they look more or less exactly like) by the article's use of U+212A (Kelvin symbol) as sample text, so I had to look it up [1].

Anyway, according to Wikipedia the dedicated symbol should not be used:

However, this is a compatibility character provided for compatibility with legacy encodings. The Unicode standard recommends using U+004B K LATIN CAPITAL LETTER K instead; that is, a normal capital K.

That was comforting, to me. :)

[1]: https://en.wikipedia.org/wiki/Kelvin#Orthography

jjmarr•1mo ago
> I find it confusing when Unicode "shadows" of normal letters exist, and those are of course also dangerous in some cases when they can be mis-interpreted for the letter they look more or less exactly like

Isn't this why Unicode normalization exists? This would let you compare Unicode letters and determine if they are canonically equivalent.

ComputerGuru•1mo ago
Normalization wouldn’t address this.
happytoexplain•1mo ago
What do you mean? All four normal forms of the Kelvin 'K' are the Latin 'K', as far as I can tell.
nwellnhof•1mo ago
Normalization forms NFKC and NFKD that also handle compatibility equivalence do.
mananaysiempre•1mo ago
A few deprecated characters, including the Kelvin and Ångström symbols, are in fact canonically equivalent to their replacements and not just compatibility equivalent, so plain NFC/NFD is enough. (It’s generally better to avoid NFKC/NFKD normalizations unless you fully understand the implications, as they do lose meaning and at the same time do not account for all possible confusables.)
Sesse__•1mo ago
It's why the Unicode Collation Algorithm exists.

If you look in allkeys.txt (the base UCA data, used if you don't have language-specific stuff in your comparisons) for the two code points in question, you'll find:

  004B  ; [.2514.0020.0008] # LATIN CAPITAL LETTER K
  212A  ; [.2514.0020.0008] # KELVIN SIGN
The numbers in the brackets are values on level 1 (base), level 2 (typically used for accents), level 3 (typically used for case). So they are to compare identical under the UCA, in almost every case except for if you really need a tiebreaker.

Compare e.g. :

  1D424 ; [.2514.0020.0005] # MATHEMATICAL BOLD SMALL K
which would compare equal to those under a case-insensitive accent-sensitive collation, but _not_a case-sensitive one (case-sensitive collations are always accent-sensitive, too).
happytoexplain•1mo ago
Are the meanings for the levels for each code point defined somewhere (accent, casing, etc)?
Sesse__•1mo ago
Typically it is defined by the collation. For the default collation, where all the weights are as in the file, it's none/accent/accent+case. But if you go to e.g. Japanese, you can have a fourth level of “kana-sensitive” (which distinguishes between e.g. katakana and hiragana).
xking5205•1mo ago
its good
ashvardanian•1mo ago
This article is about the ugliest — but arguably the most important — piece of open-source software I’ve written this year. The write-up ended up long and dense, so here’s a short TL;DR:

I grouped all Unicode 17 case-folding rules and built ~3K lines of AVX-512 kernels around them to enable fully standards-compliant, case-insensitive substring search across the entire 1M+ Unicode range, operating directly on UTF-8 bytes. In practice, this is often ~50× faster than ICU, and also less wrong than most tools people rely on today—from grep-style utilities to products like Google Docs, Microsoft Excel, and VS Code.

StringZilla v4.5 is available for C99, C++11, Python 3, Rust, Swift, Go, and JavaScript. The article covers the algorithmic tradeoffs, benchmarks across 20+ Wikipedia dumps in different languages, and quick starts for each binding.

Thanks to everyone for feature requests and bug reports. I'll do my best to port this to Arm as well — but first, I'm trying to ship one more thing before year's end.

fatty_patty89•1mo ago
Thank you

do the go bindings require cgo?

ashvardanian•1mo ago
The GoLang bindings – yes, they are based on cGo. I realize it's suboptimal, but seems like the only practical option at this point.
fatty_patty89•1mo ago
In a normal world the Go C FFI wouldn't have insane overhead but what can we do, the language is perfect and it will stay that way until morale improves.

Thanks for the work you do

kardianos•1mo ago
In a real (not "normal") world, trade-offs exist and Go choose a specific set of design points that are consequential.
kbolino•1mo ago
There are undoubtedly still some optimizations lying around, but the biggest source of Go's FFI overhead is goroutines.

There's only two "easy" solutions I can see: switch to N:N threading model or make the C code goroutine-aware. The former would speed up C calls at the expense of slowing down lots of ordinary Go code. Personally, I can still see some scenarios where that's beneficial, but it's pretty niche. The latter would greatly complicate the use of cgo, and defeat one of its core purposes, namely having access to large hard-to-translate C codebases without requiring extensive modifications of them.

A lot of people compare Go's FFI overhead to that of other natively compiled languages, like Zig or Rust, or to managed runtime languages like Java (JVM) or C# (.NET), but those alternatives don't use green threads (the general concept behind goroutines) as extensively. If you really want to compare apples-to-apples, you should compare against Erlang (BEAM). As far as I can tell, Erlang NIFs [1] are broadly similar to purego [2] calls, and their runtime performance [3] has more or less the same issues as CGo [4].

[1]: https://www.erlang.org/doc/system/nif.html

[2]: https://pkg.go.dev/github.com/ebitengine/purego

[3]: https://erlang.org/documentation/doc-10.1/doc/efficiency_gui...

[4]: https://www.reddit.com/r/golang/comments/12nt2le/when_dealin...

fatty_patty89•1mo ago
Java has green threads and c#/.net has logical threads
kbolino•1mo ago
Yes, I have cleaned up the wording a bit. Also, the common implementation of Rust's async is comparable to green threads, and I think Zig is adopting something like it too.

However, the "normal" execution model on all of them is using heavyweight native threads, not green threads. As far as I can tell, FFI is either unsupported entirely or has the same kind of overhead as Go and Erlang do, when used from those languages' green threads.

fatty_patty89•1mo ago
Genuine question, you make it seem as this is a limitation and they're all in the same bucket but how was Java for example able to scale all the enterprises while having multi threading and good ffi, same with .net.

My impression is that the go ffi is with big overhead because of the specific choices made to not care about ffi because it would benefit the go code more?

My point was that there's other gc languages/envorionments that have good ffi and were somehow able all these decades to create scalable multithreaded applications.

kbolino•1mo ago
I would suggest gaining a better understanding of the M:N threading model versus the N:N threading model. I do not know that I can do it justice here.

Both Java and Rust flirted with green threads in their early days. Java abandoned them because the hardware wasn't ready yet, and Rust abandoned them because they require a heavyweight runtime that wasn't appropriate for many applications Rust was targeting. And yet, both languages (and others besides) ended up adding something like them in later anyway, albeit sitting beside, rather than replacing, the traditional N:N threading they primarily support.

Your question might just be misdirected; one could view it as operating systems, and not programming languages per se, that screwed it all up. Their threads, which were conservatively designed to be as compatible as possible with existing code, have too much overhead for many tasks. They were good enough for awhile, especially as multicore systems started to enter the scene, but their limitations became apparent after e.g. nginx could handle 10x the requests of Apache httpd on the same hardware. This gap would eventually be narrowed, to some extent, but it required a significant amount of rework in Apache.

If you can answer the question of why ThreadPoolExecutor exists in Java, then you are about halfway to answering the question about why M:N threading exists. The other half is mostly ergonomics; ThreadPoolExecutor is great for fanning out pieces of a single, subdividable task, but it isn't great for handling a perpetual stream of unrelated tasks that ebb and flow over time. EDIT: See the Project Loom proposal for green threads in Java today, which also brings up the ForkJoinPool, another approach to M:N threading: https://cr.openjdk.org/~rpressler/loom/Loom-Proposal.html

adzm•1mo ago
This is a truly amazing accomplishment. Reading these kernels is a joy!
dboon•1mo ago
This is exactly the kind of thankless software which the world operates on. It’s unfortunate that such fundamental code hasn’t already been vectorized or the gills, but thank you for doing so! It’s excellent work
Sesse__•1mo ago
> I grouped all Unicode 17 case-folding rules

But why are you using the case-folding rules and not the collation rules?

ashvardanian•1mo ago
Yes, CaseFolding.txt. I'm considering using the collation rules for sorting. Now they only target lexicographic comparisons and seem to be 4x faster than Rust's standard quick-sort implementation, but few people use it: https://github.com/ashvardanian/StringWars?tab=readme-ov-fil...
zvr•1mo ago
Thank you for this, and congrats on your achievement!
kardianos•1mo ago
Looks neat. What are all the genomic sequence comparisons in there for? Is this a grab bag of interesting string methods or is there a motivation for this?
ashvardanian•1mo ago
Levenshtein distance calculations are a pretty generic string operation, Genomics happens to be one of the domains where they are most used... and a passion of mine :)
moralestapia•1mo ago
Ash is amazing!

Also very cool and approachable guy.

(Best wishes if you're reading this.)

hans_castorp•1mo ago
Would be cool if that could be integrated into Postgres :)
ashvardanian•1mo ago
I was just about to ask some friends about it. If I’m not mistaken, Postgres began using ICU for collation, but not string matching yet. Curious if someone here is working in that direction?
bbminner•1mo ago
I was really confused about the case folding, this page explained the motivation well https://jean.abou-samra.fr/blog/unicode-misconceptions

""" Continuing with the previous example of “ß”, one has lowercase("ss") != lowercase("ß") but uppercase("ss") == uppercase("ß"). Conversely, for legacy reasons (compatibility with encodings predating Unicode), there exists a Kelvin sign “K”, which is distinct from the Latin uppercase letter “K”, but also lowercases to the normal Latin lowercase letter “k”, so that uppercase("K") != uppercase("K") but lowercase("K") == lowercase("K").

The correct way is to use Unicode case folding, a form of normalization designed specifically for case-insensitive comparisons. Both casefold("ß") == casefold("ss") and casefold("K") == casefold("K") are true. Case folding usually yields the same result as lowercasing, but not always (e.g., “ß” lowercases to itself but case-folds to “ss”). """

One question I have is why have Kelvin sign that is distinct from Latin K and other indistinguishable symbols? To make quantified machine readable (oh, this is not a 100K license plate or money amount, but a temperature)? Or to make it easier for specialized software to display it in correct placed/units?

bee_rider•1mo ago
They seem to have (if I understand correctly) degree-Celsius and degree-Fahrenheit symbols. So maybe Kelvin is included for consistency, and it just happens to look identical to Latin K?

IMO the confusing bit is giving it a lower case. It is a symbol that happens to look like an upper case, not an actual letter…

bigwheels•1mo ago
And why can't the symbol be a regular old uppercase "K"? Who is this helping?
bee_rider•1mo ago
I think just using uppercase Latin K is the recommendation.

But, I dunno. Why would anybody apply upper or lower case operators to a temperature measurement? It just seems like a nonsense thing to do.

zygentoma•1mo ago
Maybe not for text to be read again, but might be sensible e.g. for slug or file name generation and the like...
bee_rider•1mo ago
That’s an interesting thought.

IMO this is somewhere where if we were really doing something, we might as well go all the way and double check the relevant standards, right? The filesystem should accept some character set for use as names, and if we’re generating a name inside our program we should definitely find a character set that fits inside what the filesystem expects and that captures what we want to express… my gut says upper case Latin K would be the best pick if we needed to most portably represent Kelvin in a filename on a reasonably modern, consumer filesystem.

Eisenstein•1mo ago
I wonder if you can register a domain with it in the name.
infogulch•1mo ago
Unicode wants to be able to preserve round-trip re-encoding from this other standard which has separate letter-K and degree-K characters. Making these small sacrifices for compatibility is how Unicode became the defacto world standard.
shiomiru•1mo ago
The "other standard" in this case being IBM-944. (At least looking at https://www.unicode.org/versions/Unicode1.0.0/ch06.pdf p. 574 (=110 in the PDF) I only see a mapping from U+212A to that one.)
kahirsch•1mo ago
The ICU mappings files have entries for U212A in the following files:

    gb18030.ucm
    ibm-1364_P110-2007.ucm
    ibm-1390_P110-2003.ucm
    ibm-1399_P110-2003.ucm
    ibm-16684_P110-2003.ucm
    ibm-933_P110-1995.ucm
    ibm-949_P110-1999.ucm
    ibm-949_P11A-1999.ucm
ahoka•1mo ago
Probably useful in a non-latin codeset?
UltraSane•1mo ago
having a dedicated Kelvin symbol preserves the semantics.
oneshtein•1mo ago
A symbol may look differently than original letter, for example N - №, € - E (Є), S - $, integral, с - ©, TM - ™, a - @, and so on.

However, those symbols doesn't have lower case variants. Moreover, lower case k means kilo-, not a «smaller Kelvin».

bee_rider•1mo ago
Although it is a prefix in that case, so we should expect to see k alone.

To maximally confuse things, I suggest we start using little k alone to resolve another annoying unit issue: let’s call 1 kilocalorie “k.”

bawolff•1mo ago
> One question I have is why have Kelvin sign that is distinct from Latin K and other indistinguishable symbols?

Unicode has the goal of being a 1:1 mapping for all other character encodings. Usually weird things like this is so there can be a 1:1 reversible mapping to some ancient character encoding.

peterfirefly•1mo ago
> One question I have is why have Kelvin sign that is distinct from Latin K and other indistinguishable symbols?

To allow round-tripping.

Unicode did not win by being better than all previously existing encodings, even though it clearly was.

It won by being able to coexist with all those other encodings for years (decades) while the world gradually transitioned. That required the ability to take text in any of those older encodings and transcode it to Unicode and back again without loss (or "gain"!).

anonnon•1mo ago
> ICU has bindings for Rust that provide case-folding functionality, but not case-insensitive substring search.

> ICU has many bindings. The Rust one doesn’t expose any substring search functionality, but the Python one does:

Python's ICU support is based on ICU4C. Rust's ICU "bindings" are actually a new implementation called ICU4X, by developers who worked on i18n at Mozilla and Google and on ICU4C, with the goal of a cleaner, more performant implementation that is also memory safe. Maybe not relevant (as in substantially altering the benchmarks), but it's at least worth noting that the ICU backends aren't consistent throughout.

ashvardanian•1mo ago
Thanks a lot for the correction! I'll adjust the references in a bit.
bawolff•1mo ago
While this is definitely really cool, wouldn't people who need this type of speed usually just store the text to be searched already case folded?
user3939382•1mo ago
I just used AVX-512 today for a lisp processor very performant.