The macOS LC_COLLATE hunt: Or why does sort order differently on macOS and Linux (2020)

https://blog.zhimingwang.org/macos-lc_collate-hunt

97•g0xA52A2A•3mo ago

Comments

OptionOfT•3mo ago

Updated link to the file as https://opensource.apple.com/source/adv_cmds/adv_cmds-118/us... doesn't work anymore: https://github.com/apple-oss-distributions/adv_cmds/blob/adv...

loeg•3mo ago

(2020)

skopje•3mo ago

So the ISO way is the right way, right?

dataflow•3mo ago

I wondered the same. What's the right ordering?

monerozcash•3mo ago

The right way is the one that you choose yourself and suits your needs.

There's no default right answer to this, as the answer depends entirely on what you're sorting and how you want it sorted. Even for a given character set the "correct" alphabetical sorting is still locale dependent.

And even knowing all that, "correct" programmatic sorting might still be essentially impossible. Some digraphs may be sorted differently depending on the specific word. For example A vs Aa, where Aa means Å. But Aa won't always necessarily mean Å, so good luck figuring that out.

asveikau•3mo ago

Sorting is language specific even if you're restricted to languages using Latin characters. Eg. How do you sort N relative to Ñ? How do you treat the Turkish variations on the letter I?

Doing a dumb sort by character or byte values is obviously the wrong call for any diacritics, but the right call may also depend on the language.

dmurray•3mo ago

And that's why there are a hundred different possible values for LC_COLLATE, and it's completely normal that two popular Unix distributions picked different default values for that setting...right?

It would have been reasonable to conclude the article a third of the way through, and say "sorting is locale-dependent, if what you value is consistent behaviour between different OSs (instead of sorting based on the user's preferences) you need to implement the sorting yourself."

harrall•3mo ago

LC_ALL=C which gives you consistent sorting behavior.

The article does mention it but in passing.

encom•3mo ago

Before the Danish language adopted the letter "å" (in 1948), the vowel was written as "aa". In the Danish alphabet, "å" is the last letter. Therefore a list of three Danish city names would be correctly sorted as:

  * Albertslund
  * Odense
  * Aarhus

This feels like material for another Tom Scott video.

tpmoney•3mo ago

Not Tom Scott, but Dylan Beattie has done a handful of interesting talks[1] effectively on "there's no such thing as plain text" which in part covers this sort of thing. In fact, I think your Danish cities list is actually one of his examples.

[1]: https://www.youtube.com/watch?v=gd5uJ7Nlvvo

encom•3mo ago

Finally had time to watch it, that was excellent. Thanks for the link.

Pike matchbox.

plufz•3mo ago

Haha. Like it was enough with ” tooghalvfems”.

qw•3mo ago

And to make it more interesting, Sweden also has the letter "å", but it's in the 27th place in the alphabet (followed by "ä" and "ö"). In the Danish/Norwegian alphabet, the letter "å" is the last letter of the alphabet.

tracker1•3mo ago

Beyond that, are what/why you are sorting... should File1.foo come before File005.foo or file020.foo? I've honestly thought about creating my own file manager just to case-insensitively sort files where sequences of numbers are padded to the same length, and only if there's an identical match is case-sensitivity put lower first, then upper on first original difference.

My worry is that it would perform badly on really large directories... That said, for where it's a pain, it would be helpful to say the least.

1718627440•3mo ago

It isn't even language/nation dependent, there are also different official sorting orders in a single language dependent on the context, e.g. phone book vs. dictionary.

And then a lot of languages are used in different countries with different rules.

pjmlp•3mo ago

Yet another one of those POSIX and ISO things that most people don't bother to know about.

https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1...

greesil•3mo ago

It's not a stable sort?

o11c•3mo ago

Minor note: on Debian (and possibly other distros), you don't have to use `locale-gen` to dynamically build things into `$complocaledir/locale-archive` (which, incidentally, can cause random breakage for programs that happen to start during system upgrades).

The `locales-all` package works more like macOS. It's only a ~10MB download but unpacks to take ~250MB of disk space (these numbers will vary based on your libc version and packaging format).

There are a lot of sparse arrays and UTF32 character data in compiled locales.

Incidentally, the command to dump a locale's data is:

  LC_ALL=whatever locale -ck `locale | sed 's/=.*//; /LANG\|LC_ALL/d'`

1a527dd5•3mo ago

Ask anyone who did a postgres upgrade. The words "collate" and "glibc" are enough to cause me to pause now. Learnt loads, never going to really use it again, but man do I understand the pain that causes now.

bluedino•3mo ago

Now I'm remembering all the fun we had a long time ago with php websites that used an AS/400 for a data source. They didn't sort the same, and the mom and pop web dev shop that was hired to create the web site didn't understand the issue and hacked around it and failed.

kenada•3mo ago

When I updated the Darwin SDK and source releases in nixpkgs last year, I tried using the FreeBSD locale data. It worked in a technical sense, but it broke things that depended on the quirks in the Apple’s locale data. That statement about compatibility is unfortunately true.

kbd•3mo ago

In my Zsh startup on Mac I had to worry about collation, as I expected ~ to sort last (I have a directory prefixed with ~ to load plugins that need to be loaded last). Idk why a locale of utf-8 has it sorting differently, but I needed LC_COLLATE=C to have it sort as expected:

    # source all shell config
    export LC_COLLATE=C # ensure consistent sort, ~ at end
    for file in ~/bin/shell/**/*.(z|)sh; do
      source "$file";
    done

USS Preble Used Helios Laser to Zap Four Drones in Expanding Testing

Show HN: Animated beach scene, made with CSS

An update on unredacting select Epstein files – DBC12.pdf liberated

Was going to share my work

Pitchfork: A devilishly good process manager for developers

You Are Here

Why social apps need to become proactive, not reactive

How patient are AI scrapers, anyway? – Random Thoughts

Vouch: A contributor trust management system

I built a terminal monitoring app and custom firmware for a clock with Claude

Tiny C Compiler

Y Combinator Founder Organizes 'March for Billionaires'

Ask HN: Need feedback on the idea I'm working on

OpenClaw Addresses Security Risks

Apple finalizes Gemini / Siri deal

Italy Railways Sabotaged

Emacs-tramp-RPC: high-performance TRAMP back end using MsgPack-RPC

Nintendo Wii Themed Portfolio

"There must be something like the opposite of suicide "

Ask HN: Why doesn't Netflix add a “Theater Mode” that recreates the worst parts?

Show HN: Engineering Perception with Combinatorial Memetics

Show HN: Steam Daily – A Wordle-like daily puzzle game for Steam fans

The Anthropic Hive Mind

Just Started Using AmpCode

LLM as an Engineer vs. a Founder?

Crosstalk inside cells helps pathogens evade drugs, study finds

Show HN: Design system generator (mood to CSS in <1 second)

Show HN: 26/02/26 – 5 songs in a day

Toroidal Logit Bias – Reduce LLM hallucinations 40% with no fine-tuning

Top AI models fail at >96% of tasks