As the article demonstrates, the error manifests in a completely inscrutable way. But once I saw the bug from a couple of users with Turkish sounding names, I zeroed in on it. And cursed a few times under my breath whoever messed up that character table so bad.
(I'm sure there's a good reason, but I find it odd that compiler message tags are invariably uppercase, but in this problem code they lowercased it to go do a lookup from an enum of lowercase names. Why isn't the enum uppercase, like the things you're going to lookup?)
Defaulting to ROOT makes a lot of sense for internal constants, like in the example in this article, but defaulting to ROOT for everything just exposes the problems that caused Sun to use the user locale by default in the first place.
It's like they decided that the uppercase of "a" is "E" and the uppercase of "e" is "A".
There is no reason to assume that the English representation is in general "correct", "standard", or even "first". The modern script for Turkish was adopted around the 1920's, so you could argue perhaps that most typewriters presented a standard that should have been followed. However, there was variation even between different typewriters, and I strongly suspect that typewriters weren't common in Turkey when the change was made.
It does in literally any language using a latin alphabet other than Turkish.
Not quite. In English, 'i' and 'I' are two allographs of one grapheme, corresponding to many phonemes, based on context. (Using linguistic definitions here, not compsci ones.) The 'i's in 'kit' and 'kite' stand for different phonemes, for example.
> There is no reason to assume that the English representation is in general "correct", "standard", or even "first".
Correct, but the I/i allography is not exclusive to English. Every Latin script functions that way, other than Turkish and Turkish-derived scripts.
No one is saying Turkish cannot break from that convention - they can feel free to do anything they like - but the resulting issues are fairly predictable, and their adverse effects fall mainly on Turkish speakers in practice, not on the rest of us.
You're right, apologies my linguistics is rusty and I was overconfident.
> Correct, but the I/i allography is not exclusive to English. Every Latin script functions that way, other than Turkish and Turkish-derived scripts.
I think my main argument is that the importance of standardizing to i/I was much less obvious in the 1920's. The benefits are obvious to us now, but I think we would be hard pressed to predict this outcome a-priori.
I don't think it's fair to call it predictable. When this convention was chosen, the problem of "what is the uppercase letter to I" was always bound to the context of language. Now it suddenly isn't. Shikata ga nai. It wasn't even an explicit assumption that can be reflected upon, it was an implicit one, that just happened.
Ö and ü were already borrowed from German alphabet. Umlaut-added variants of 'ö' and 'ü' have a similar effect on 'o' and 'u' respectively: they bring a back vowel to front. See: https://en.wikipedia.org/wiki/Vowel . Similarly removing the dots bring them back.
Turkish already had i sound and its back variant which is a schwa-like sound: https://en.wikipedia.org/wiki/Close_back_unrounded_vowel . It has the same relation in IPA as 'ö' has to 'o' and 'ü' has to 'u'. Since the makers of the Turkish variant of Latin Alphabet had the rare chance of making a regular pronunciation system with the state of the language and since removing the dots had the effect of making a front vowel a back vowel, they simply copied this feature from ö and ü to i:
Just remove the dots to make it a back vowel! Now we have ı.
When comes to capitalization, ö becomes Ö, ü becomes Ü. So it is just logical to make the capital of i İ and the lowercase of I ı.
Of course the latin capital I is dotless because originally the lowercase latin "i" was also dotless. The dot has been added later to make text more legible.
Also, we don't have serifs in our I. It's just a straight line. So, it's not even related to your Ii pair in English. You can't dictate how we write our straight lines, can you.
The root cause of the problem is in the implementation and standardization of the computer systems. Computers are originally designed only for English alphabet in mind. And patched to support other languages over time, poorly. Computers should obey the language rules, not the other way around.
That depends on font.
>So, it's not even related to your Ii pair in English.
Modern Turkish uses the Latin script, of course it's related.
>You can't dictate how we write our straight lines, can you.
No, I can't, I just want to understand why the Turks decided to change this letter, and this letter only, from the rest of the standard Latin script/diacritics.
The latinization reform of the Turkish language predates computers and it was hard to foresee the woes that future generations would have had with that choice
Really, this bug is nothing but programmers failing to take into account that not everybody writes in English.
When I shared computers with my parents I had to switch languages back-and-forth all the time. This helped me learn English rather quickly but, I find it a huge accessibility and software design issue.
If your program depends on letter cases, that is a badly designed program, period. If a language ships toUpper or a toLower function without a mandatory language field, it is badly designed too. The only slightly-better option is making toUpper and toLower ASCII-only and throwing error for any other character set.
While half of the language design of C is questionable and outright dangerous, making its functions locale-sensitive by all popular OSes was an avoidable mistake. Yet everybody did that. Just the existence of this behavior is a reason I would like to get rid of anything GNU-based in the systems I develop today.
I don't care if Unicode releases a conversion map. Natural-language behavior should always require natural language metadata too. Even modern languages like Rust did a crappy job of enforcing it: https://doc.rust-lang.org/std/primitive.char.html#method.to_... . Yes it is significantly safer but converting 'ß' to 'SS' in German definitely has gotchas too.
Isn't the choice of language and date and unit formats normally independent.
> Isn't the choice of language and date and unit formats normally independent.
You would hope so but, no. Quite a bit software tie the language setting to Locale setting. If you are lucky, they will provide an "English (UK)" option (which still uses miles or FFS WTF is a stone!).
On Windows you can kinda select the units easily. On Linux let me introduce you to the journey to LC_ environment variables: https://www.baeldung.com/linux/locale-environment-variables . This doesn't mean the websites or the apps will obey them. Quite a few of them don't and just use LANGUAGE, LANG or LC_TYPE as their setting.
My company switched to Notion this year (I still miss Confluence). It was hell until last month since they only had "English (US)" and used M/D/Y everywhere with no option to change!
POSIX requires that many functions account for the current locale. I'm not sure why you are blaming GNU for this.
Also, this is the last remaining major system-dependent default in Java. They made strict floating point the default in 17; UTF-8 the default encoding some versions later (21?); only the locale remains. I hope they make ROOT the default in an upcoming version.
FWIW, in the Scala.js implementation, we've been using UTF-8 and ROOT as the defaults forever.
carstenhag•1h ago
Unrelated, but a month ago I found a weird behaviour where in a kotlin scratch file, `List.isEmpty()` is always true. Questioned my sanity for at least an hour there... https://youtrack.jetbrains.com/issue/KTIJ-35551/
ajkjk•1h ago