A years-long Turkish alphabet bug in the Kotlin compiler

https://sam-cooper.medium.com/the-country-that-broke-kotlin-84bdd0afb237

43•Bogdanp•5h ago

Comments

carstenhag•1h ago

I was scrolling and scrolling, waiting for the author to mention the new methods, which of course every Android Dev had to migrate to at some point. And 99% of us probably thought how annoying this change is, even though it probably reduced the number of bugs for Turkish users :)

Unrelated, but a month ago I found a weird behaviour where in a kotlin scratch file, `List.isEmpty()` is always true. Questioned my sanity for at least an hour there... https://youtrack.jetbrains.com/issue/KTIJ-35551/

ajkjk•1h ago

well now I wanna know what's going on there!

johnyzee•1h ago

Ugh, I've had the exact same problem in a Java project, which meant I had to go through thousands and thousands of lines of code and make sure that all 'toLowerCase()' on enum names included Locale.ENGLISH as parameter.

As the article demonstrates, the error manifests in a completely inscrutable way. But once I saw the bug from a couple of users with Turkish sounding names, I zeroed in on it. And cursed a few times under my breath whoever messed up that character table so bad.

nradov•1h ago

Were you not using static analysis tools? All of the popular ones will warn about that issue with locales.

mikestew•1h ago

When I saw "Turkish alphabet bug", I just knew it was some version of toLower() gone horribly wrong.

(I'm sure there's a good reason, but I find it odd that compiler message tags are invariably uppercase, but in this problem code they lowercased it to go do a lookup from an enum of lowercase names. Why isn't the enum uppercase, like the things you're going to lookup?)

charcircuit•1h ago

Everyone who has used Java has hit this before. Java really should force people to always specify the locale and get rid of the versions of the functions without locale parameters. There is so much hidden broken code out there.

Uvix•57m ago

That only helps if devs specify an invariant locale (ROOT for Java) where needed. In practice, I think you'll see devs blindly using using the user's current locale like it silently does today.

jeroenhd•36m ago

The invariant locale can't parse the numbers I enter (my locale uses comma as a decimal separator). More than a few applications will reject perfectly valid numbers. Intel's driver control panel was even so fucked up that I needed to change my locale to make it parse its own UI layout files.

Defaulting to ROOT makes a lot of sense for internal constants, like in the example in this article, but defaulting to ROOT for everything just exposes the problems that caused Sun to use the user locale by default in the first place.

zettabomb•1h ago

I have always wondered why Turkey chose to Latinize in this way. I understand that the issue is having two similar vowels in Turkish, but not why they decided to invent the dotless I, when other diacritics already existed. Ĭ Î Ï Í Ì Į Ĩ and almost certainly a dozen other would've worked, unless there was already some significance to the dot in Turkish that's not obvious.

mrighele•1h ago

The issue is not the invention of the dotless I, it already exists, the issue is that the took a vowerl , i/I, and the assigned the lower case to one vowel, and the upper case to a different one, and invented what left missing.

It's like they decided that the uppercase of "a" is "E" and the uppercase of "e" is "A".

pinkmuffinere•1h ago

This is misleading, because it assumes that i/I naturally represent one vowel, which is just not the case. i/I represents one vowel in _English_, when written with a latin script. ̶I̶n̶ ̶f̶a̶c̶t̶ ̶e̶v̶e̶n̶ ̶t̶h̶i̶s̶ ̶i̶s̶n̶'̶t̶ ̶c̶o̶r̶r̶e̶c̶t̶,̶ ̶i̶/̶I̶ ̶r̶e̶p̶r̶e̶s̶e̶n̶t̶s̶ ̶o̶n̶e̶ ̶p̶h̶o̶n̶e̶m̶e̶,̶ ̶n̶o̶t̶ ̶o̶n̶e̶ ̶v̶o̶w̶e̶l̶.̶ <see troad's comment for correction>

There is no reason to assume that the English representation is in general "correct", "standard", or even "first". The modern script for Turkish was adopted around the 1920's, so you could argue perhaps that most typewriters presented a standard that should have been followed. However, there was variation even between different typewriters, and I strongly suspect that typewriters weren't common in Turkey when the change was made.

ginko•1h ago

>This is misleading, because it assumes that i/I naturally represent one vowel, which is just not the case.

It does in literally any language using a latin alphabet other than Turkish.

pinkmuffinere•55m ago

This may be correct, I'd have to do a 'real' search, which I'm too lazy to do, lol sorry. However there are definitely other (non-latin) scripts that have either i or I, but for which i/I is not a correct pair. For example, greek has ι/Ι too.

okanat•55m ago

All other Turkic languages also copied this for their Latin script: https://en.wikipedia.org/wiki/Dotless_I

troad•56m ago

> In fact even this isn't correct, i/I represents one phoneme, not one vowel.

Not quite. In English, 'i' and 'I' are two allographs of one grapheme, corresponding to many phonemes, based on context. (Using linguistic definitions here, not compsci ones.) The 'i's in 'kit' and 'kite' stand for different phonemes, for example.

> There is no reason to assume that the English representation is in general "correct", "standard", or even "first".

Correct, but the I/i allography is not exclusive to English. Every Latin script functions that way, other than Turkish and Turkish-derived scripts.

No one is saying Turkish cannot break from that convention - they can feel free to do anything they like - but the resulting issues are fairly predictable, and their adverse effects fall mainly on Turkish speakers in practice, not on the rest of us.

pinkmuffinere•50m ago

> Not quite. In English, 'i' and 'I' are two allographs of one grapheme, corresponding to many phonemes, based on context. (Using linguistic definitions here, not compsci ones.) The 'i's in 'kit' and 'kite' stand for different phonemes, for example.

You're right, apologies my linguistics is rusty and I was overconfident.

> Correct, but the I/i allography is not exclusive to English. Every Latin script functions that way, other than Turkish and Turkish-derived scripts.

I think my main argument is that the importance of standardizing to i/I was much less obvious in the 1920's. The benefits are obvious to us now, but I think we would be hard pressed to predict this outcome a-priori.

Muromec•42m ago

> but the resulting issues are fairly predictable, and their adverse effects fall mainly on Turkish speakers in practice, not on the rest of us.

I don't think it's fair to call it predictable. When this convention was chosen, the problem of "what is the uppercase letter to I" was always bound to the context of language. Now it suddenly isn't. Shikata ga nai. It wasn't even an explicit assumption that can be reflected upon, it was an implicit one, that just happened.

steezeburger•1h ago

I don’t think that’s the right way to think about it. It’s not like they were Latinizing Turkish with ASCII in mind. They wanted a one-to-one mapping between letters and sounds. The dot versus no dot marks where in your mouth or throat the vowel is formed. They didn’t have this concept that capital I automatically pairs with lowercase i. The dot was always part of the letter itself. The reform wasn’t trying to fit existing Western conventions, it was trying to map the Turkish sounds to symbols.

okanat•56m ago

Not really. Turkish has a feature that is called "vowel harmony". You match suffixes you add to a word based on a category system: low pitch vs high pitch vowels where a,ı,o,u are low pitch and e,i,ö,ü are high pitch.

Ö and ü were already borrowed from German alphabet. Umlaut-added variants of 'ö' and 'ü' have a similar effect on 'o' and 'u' respectively: they bring a back vowel to front. See: https://en.wikipedia.org/wiki/Vowel . Similarly removing the dots bring them back.

Turkish already had i sound and its back variant which is a schwa-like sound: https://en.wikipedia.org/wiki/Close_back_unrounded_vowel . It has the same relation in IPA as 'ö' has to 'o' and 'ü' has to 'u'. Since the makers of the Turkish variant of Latin Alphabet had the rare chance of making a regular pronunciation system with the state of the language and since removing the dots had the effect of making a front vowel a back vowel, they simply copied this feature from ö and ü to i:

Just remove the dots to make it a back vowel! Now we have ı.

When comes to capitalization, ö becomes Ö, ü becomes Ü. So it is just logical to make the capital of i İ and the lowercase of I ı.

ithkuil•41m ago

Yes it's hard to come up with a different capital than I unless you somehow can see into the future and foresee the advent of computers, which the Turkish alphabet reform predates.

Of course the latin capital I is dotless because originally the lowercase latin "i" was also dotless. The dot has been added later to make text more legible.

ozgung•48m ago

Nope, we decided to do it the correct and logical way for our alphabet. Some glyphs are either dotted or dotless. So, we have Iı, İi, Oo, Öö, Uu, Üü, Cc, Çç, Ss and Şş. You see the Ii pair is actually the odd one in the series.

Also, we don't have serifs in our I. It's just a straight line. So, it's not even related to your Ii pair in English. You can't dictate how we write our straight lines, can you.

The root cause of the problem is in the implementation and standardization of the computer systems. Computers are originally designed only for English alphabet in mind. And patched to support other languages over time, poorly. Computers should obey the language rules, not the other way around.

zettabomb•37m ago

>Also, we don't have serifs in our I.

That depends on font.

>So, it's not even related to your Ii pair in English.

Modern Turkish uses the Latin script, of course it's related.

>You can't dictate how we write our straight lines, can you.

No, I can't, I just want to understand why the Turks decided to change this letter, and this letter only, from the rest of the standard Latin script/diacritics.

ayhanfuat•1h ago

Except for the a/e pair, front and back vowels have dotted and dotless versions in Turkish: ı and i, o and ö, u and ü.

zettabomb•58m ago

Makes sense enough, but why not use i and ï to be consistent?

okanat•44m ago

Turkish i/İ sounds pretty similar to most of the European languages. Italian, French and German pronounce it pretty similar. Also removing umlauts from the other two vowels ö and ü to write o and u has the same effect as removing the dot from i. It is just consistent.

zettabomb•32m ago

No, what I mean is, o and u get an umlaut (two dots) to become ö and ü, but i doesn't get an umlaut, it's just a single dot from ı to i. Why not make it i and ï? That would be more consistent, in my opinion.

ayhanfuat•19m ago

This was shortly after the Turkish War of Independence. Illiteracy was quite high (estimated at over 85%) and the country was still being rebuilt. My guess is they did their best to represent all the sounds while creating a one to one mapping between sounds and letters but also not deviating too much from familiar forms. There were probably conflicting goals so inconsistencies were bound to happen.

o11c•46m ago

In that case they should've used ï for consistency.

nurettin•1h ago

There was actually three! i (as in th[i]s), î (as in ch[ee]se) and ı which sounds nothing like the first two, it sounds something like the e in bag[e]l. I guess it sounded so different that it warranted such a drastic symbolic change.

ithkuil•45m ago

Turkish exhibits a vowel harmony system and uses diacritics on other vowels too and the choice to put "i" together with other front vowels like "ü" and "ö" and put "ı" together with back vowels like "u" and "o" is actually pretty elegant.

The latinization reform of the Turkish language predates computers and it was hard to foresee the woes that future generations would have had with that choice

jeroenhd•38m ago

Computers and localisation weren't relevant back in the early 20th century. The dotless existed before the dotted i (in Greek script as iota). Some European scholars putting an extra dot on the letter to make it stand out a bit more are as much to blame as the Turks for making the distinction between the different i-vowels clear.

Really, this bug is nothing but programmers failing to take into account that not everybody writes in English.

okanat•1h ago

As a Turkish speaker who was using a Turkish-locale setup in my teenage years these kinds of bugs frustrated me infinitely. Half of the Java or Python apps I installed never run. My PHP webservers always had problems with random software. Ultimately, I had to change my system's language to English. However, US has godawful standards for everything: dates, measurement units, paper sizes.

When I shared computers with my parents I had to switch languages back-and-forth all the time. This helped me learn English rather quickly but, I find it a huge accessibility and software design issue.

If your program depends on letter cases, that is a badly designed program, period. If a language ships toUpper or a toLower function without a mandatory language field, it is badly designed too. The only slightly-better option is making toUpper and toLower ASCII-only and throwing error for any other character set.

While half of the language design of C is questionable and outright dangerous, making its functions locale-sensitive by all popular OSes was an avoidable mistake. Yet everybody did that. Just the existence of this behavior is a reason I would like to get rid of anything GNU-based in the systems I develop today.

I don't care if Unicode releases a conversion map. Natural-language behavior should always require natural language metadata too. Even modern languages like Rust did a crappy job of enforcing it: https://doc.rust-lang.org/std/primitive.char.html#method.to_... . Yes it is significantly safer but converting 'ß' to 'SS' in German definitely has gotchas too.

arccy•1h ago

use Australian English: English but with same settings for everything else, including keyboard layout

Sesse__•49m ago

Many Linux distributions provide en_DK specifically for this purpose. English as it is used in Denmark. :-)

fph•34m ago

Denmark doesn't have Euros as currency, unfortunately.

okanat•41m ago

I live in Germany now, so I generally set it to Irish nowadays. Since I like ISO-style enter key, I use UK keyboard layout (also easier to switch to Turkish than ANSI-layout). However many OSes now have a English (Europe) locale too

1718627440•1h ago

> However, US has godawful standards for everything: dates, measurement units, paper sizes.

Isn't the choice of language and date and unit formats normally independent.

neandrake•39m ago

There are OS-level settings for date and unit formats but not all software obeys that, instead falling back to using the default date/unit formats for the selected locale.

okanat•36m ago

> > However, US has godawful standards for everything: dates, measurement units, paper sizes.

> Isn't the choice of language and date and unit formats normally independent.

You would hope so but, no. Quite a bit software tie the language setting to Locale setting. If you are lucky, they will provide an "English (UK)" option (which still uses miles or FFS WTF is a stone!).

On Windows you can kinda select the units easily. On Linux let me introduce you to the journey to LC_ environment variables: https://www.baeldung.com/linux/locale-environment-variables . This doesn't mean the websites or the apps will obey them. Quite a few of them don't and just use LANGUAGE, LANG or LC_TYPE as their setting.

My company switched to Notion this year (I still miss Confluence). It was hell until last month since they only had "English (US)" and used M/D/Y everywhere with no option to change!

collinfunk•1m ago

> While half of the language design of C is questionable and outright dangerous, making its functions locale-sensitive by all popular OSes was an avoidable mistake. Yet everybody did that. Just the existence of this behavior is a reason I would like to get rid of anything GNU-based in the systems I develop today.

POSIX requires that many functions account for the current locale. I'm not sure why you are blaming GNU for this.

darkhorn•1h ago

Java; write once, run anywhere, except on Turkish Windows.

sjrd•1h ago

I am one of the maintainers is the Scala compiler, and this is one of the things that immediately jump to me when I review code that contains any casing operation. Always explicitly specify the locale. However, unlike TFA and other comments, I don't suggest `Locale.US`. That's a little too US-centric. The canonical locale is in fact `Locale.ROOT`. Granted, in practice it's equivalent, but I find it a little bit more sensible.

Also, this is the last remaining major system-dependent default in Java. They made strict floating point the default in 17; UTF-8 the default encoding some versions later (21?); only the locale remains. I hope they make ROOT the default in an upcoming version.

FWIW, in the Scala.js implementation, we've been using UTF-8 and ROOT as the defaults forever.

naniwaduni•15m ago

A stark reminder that all operations on strings are wrong.

esafak•14m ago

Kotlin keywords should be assumed to be English.

Free Software Hasn't Won

Wireguard FPGA

Ask HN: What are you working on? (October 2025)

Edge AI for Beginners

Emacs agent-shell (powered by ACP)

MAML – a new configuration language (similar to JSON, YAML, and TOML)

Completing a BASIC language interpreter in 2025

Three ways formally verified code can go wrong in practice

Macro Splats 2025

Bird Photographer of the Year Gives a Lesson in Planning and Patience

Tiny Teams Playbook

A whirlwind introduction to dataflow graphs (2018)

3D-Printed Automatic Weather Station

Constraint satisfaction to optimize item selection for bundles in Minecraft

A years-long Turkish alphabet bug in the Kotlin compiler

Rcyl – a recycled plastic urban bike

oavif: Faster target quality image compression

Show HN: I built a simple ambient sound app with no ads or subscriptions

AdapTive-LeArning Speculator System (ATLAS): Faster LLM inference

Addictive-like behavioural traits in pet dogs with extreme motivation for toys

Schleswig-Holstein completes migration to open source email

The neurons that let us see what isn't there

How I'm using Helix editor

HP1345A (and wargames) (2017)

Loko Scheme: bare metal optimizing Scheme compiler

Nostr and ATProto (2024)

Meta Superintelligence Labs' first paper is about RAG

After the AI boom: what might we be left with?

Ridley Scott's Prometheus and Alien: Covenant – Contemporary Horror of AI (2020)

The Flummoxagon

Free Software Hasn't Won

Wireguard FPGA

Ask HN: What are you working on? (October 2025)

Edge AI for Beginners

Emacs agent-shell (powered by ACP)

MAML – a new configuration language (similar to JSON, YAML, and TOML)

Completing a BASIC language interpreter in 2025

Three ways formally verified code can go wrong in practice

Macro Splats 2025

Bird Photographer of the Year Gives a Lesson in Planning and Patience

Tiny Teams Playbook

A whirlwind introduction to dataflow graphs (2018)

3D-Printed Automatic Weather Station

Constraint satisfaction to optimize item selection for bundles in Minecraft

A years-long Turkish alphabet bug in the Kotlin compiler

Rcyl – a recycled plastic urban bike

oavif: Faster target quality image compression

Show HN: I built a simple ambient sound app with no ads or subscriptions

AdapTive-LeArning Speculator System (ATLAS): Faster LLM inference

Addictive-like behavioural traits in pet dogs with extreme motivation for toys

Schleswig-Holstein completes migration to open source email

The neurons that let us see what isn't there

How I'm using Helix editor

HP1345A (and wargames) (2017)

Loko Scheme: bare metal optimizing Scheme compiler

Nostr and ATProto (2024)

Meta Superintelligence Labs' first paper is about RAG

After the AI boom: what might we be left with?

Ridley Scott's Prometheus and Alien: Covenant – Contemporary Horror of AI (2020)

The Flummoxagon

A years-long Turkish alphabet bug in the Kotlin compiler

Comments