What to know about encodings and character sets to work with text (2011)

https://kunststube.net/encoding/

40•ColinWright•4mo ago

Comments

ColinWright•4mo ago

Full title:

"What every programmer absolutely, positively needs to know about encodings and character sets to work with text"

____tom____•4mo ago

> Because Unicode is not an encoding.

> Overall, Unicode is yet another encoding scheme.

Terr_•4mo ago

Yeah, author seems to have made a mistake there.

> Unicode is a large table mapping characters to numbers and the different UTF encodings specify how these numbers are encoded as bits. Overall, Unicode is yet another encoding scheme.

I would guess this represents a confusion between the narrow abstract definition of Unicode versus the way it is casually used as an umbrella term which includes stuff like Transformation Formats.

jibal•4mo ago

The author doesn't understand what a character is, despite the Unicode standard making it very clear that character != codepoint

btilly•4mo ago

That's just somewhat sloppy.

Unicode is not an encoding of text to bits. It is an encoding of text to numbers. There are a variety of encodings of text to bits based on how those numbers are to be encoded into bits.

Though technically Unicode isn't even quite that. For example "é" can be encoded as U+00E9 or as U+0065,U+0301. Going the other way, "水", U+6C34, is drawn differently in simplified Chinese, Japanese, and traditional Chinese. Unicode calls this, "language-sensitive glyph variation".

Which means that the correspondence between text and Unicode is many to many both ways. And then the Unicode can show up in bits and bytes again in multiple ways.

ryandrake•4mo ago

Joel covered this[1] topic over 20 years ago (!!) and we still regularly see "senior" programmers who just casually think of text as a string and strings as text, and that's all there is to it. I still regularly see websites full of ????? and U+FFFD and apostrophes becoming â€™ everywhere.

1: https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...

TacticalCoder•4mo ago

> Text is either encoded in UTF-8 or it's not. If it's not, it's encoded in ASCII, ISO-8859-1, UTF-16 or some other encoding.

Nitpicking but if it's encoded in ASCII, it's by definition a validly encoded UTF-8 file.

jibal•4mo ago

This accurate comment was previously dead. Glad that it got resurrected.

nick49488171•4mo ago

Bitmaps. Anything outside of ASCII should be a bitmap.

bloomca•4mo ago

How would that work? How many bytes per character? How different fonts would work?

nick49488171•4mo ago

Sorry, misplaced humor.

Uehreka•4mo ago

This is the encodings equivalent of the “there should just be one timezone” take.

random3•4mo ago

The best things are those that get out of the way.

dang•4mo ago

What Every Programmer Absolutely, Positively Needs To Know About Encodings (2011) - https://news.ycombinator.com/item?id=30384223 - Feb 2022 (58 comments)

What programmers need to know about encodings and charsets (2011) - https://news.ycombinator.com/item?id=24162499 - Aug 2020 (22 comments)

What to know about encodings and character sets - https://news.ycombinator.com/item?id=9788253 - June 2015 (30 comments)

What Every Programmer Needs To Know About Encodings And Character Sets - https://news.ycombinator.com/item?id=4771987 - Nov 2012 (5 comments)

jibal•4mo ago

> Everybody is aware of this at some level, but somehow this knowledge seems to suddenly disappear in a discussion about text, so let's get it out first: A computer cannot store "letters", "numbers", "pictures" or anything else. The only thing it can store and work with are bits.

This is wrong and it goes downhill from there. I don't want to take the time and effort to fisk it, but it's full of errors like mistaking characters for codepoints and saying things like "In other words, ASCII maps 1:1 unto UTF-8" -- a bizarre and wrong way to say what he said in the previous sentence: "All characters available in the ASCII encoding only take up a single byte in UTF-8 and they're the exact same bytes as are used in ASCII".

torstenvl•4mo ago

It isn't wrong. Computers, broadly speaking, can only store binary digits.

I'm not sure if you're thinking of the Mark II, or the term as meaning human arithmeticians, or what, but that seems pedantic to the point of sophistry.

jibal•4mo ago

I've pointed out your mistakes elsewhere and won't respond to you otherwise. I just want to alert you to the fact that, when you told someone several weeks ago that "Your behavior has no place here" you were addressing an HN public moderator.

jibal•4mo ago

"pedantic to the point of sophistry"

Gotta love how those who can't comprehend reach for the ad hominem. And it's so absurdly hypocritical ... the claim that "A computer cannot store "letters", "numbers", "pictures" or anything else. The only thing it can store and work with are bits" is extraordinarily pedantic sophistry and WRONG. It comes from people who have no understanding of the concepts of a representation and abstraction and either don't know how digital storage works or are pretending not to. The many bi-state mechanisms we use for digital storage are not bits, they represent bits. And CPUs don't contain (or "store") bits, they are made of transistors that control the flow of electrons ... modeling this as "bits" is an abstraction.

But hey, I guess John von Neumann was a pedant and a sophist when he talked about stored program computers rather than stored bit computers.

torstenvl•3mo ago

An ad hominem is a fallacious argument pertaining to one or more individual characteristics of one or more persons. Criticism of an argument as pedantic and/or sophistry cannot possibly be an ad hominem, because it is a criticism of the argument itself.

By contrast, your attempt to discredit me by reference to some other interaction we seem to have had is an ad hominem. More interestingly, your reference to John von Neumann and your reference to a moderator are both also a subclass of ad hominems known as "appeals to authority."

https://ethics.org.au/ethics-explainer-ad-hominem-fallacy/

fainpul•4mo ago

Highly related recommendation: https://i18n-puzzles.com/

It's a series of tasks ("puzzles") in the style of Advent of Code. Some deal with text handling, some with dates and times.

In my opinion it's a fun way to really get this stuff in your brain (by doing, not just reading about it) and especially learn about what your programming language of choice has to offer in this department.

I find the later puzzles have a bit of an artificial difficulty increase, which makes them seem a bit far fetched and unrealistic. But the first few are definitely reasonable and applicable to real-world scenarios. You also don't have to do them in order. Unlike with AoC, all the puzzles are available from the start.

geocar•4mo ago

> Say, your app must accept files uploaded in GB18030, but internally you are handling all data in UTF-32. A tool like iconv can cleanly convert the uploaded file with a one-liner like iconv('GB18030', 'UTF-32', $string). That is, it will preserve the characters while changing the underlying bits:

Oh for goodness sake please please don't do this: Despite the appearance of the "representations" given, GB18030 is bigger than Unicode so this potentially destroys information. Almost any other `character (encoding) set' would have been a better example, but definitely not this one, and unless you already know why it might work for a long time until you discover a problem.

Actually, I do not generally recommend converting anything ever; I try to save the original customer/user submission and then any derivative use of it that needs some specific conversion can use that. If you save the bytes you were given, you can fix problems like this when they come up, but if you normalise everything before saving your golden record in your database, you might actually lose something important.

Three other things to know about "encoding and character sets" that I feel like are more important than code points:

1. If you don't know the language, you can't sort/compare, so if you think this saves you keeping track of the 'character set', well you _should_ have been tracking 'character set+language' anyway, so even if UTF32 worked, you'd still need the field for language anyway. And yeah, this affects "latin" languages too.

2. If you don't know the font, you can't figure out how big something is, draw it, wrap it, count the "characters", and so on. If you're beginning to wonder what you can do with text you can't read, you're starting to get the idea.

3. Microsoft is a massive fucking company and can't get RTL right. Bananas, right? You have no hope if you do not talk to actual human beings that use the language. This guy https://www.notarabic.com gave a talk a few years ago which I recommend if that sounds incredible.

tl;dr: text is hard, let's go to the beach.

danhau•4mo ago

At my job I have to deal with an old system that invented its own encoding, named TSS. The idea was to unify multiple charsets and encodings into one, before Unicode was a thing. But instead of coming up with one big a charset and assigning codepoints plus an encoding scheme, they thought it was wise to just repackage other encodings and charsets. Think Matroska, but for text. And yes, I do mean charsets AND encodings. Sometimes they repackage an encoding, sometimes just a charset where the codepoints are the encoding.

TSS supports the ISO-8859 charsets and corresponding (but deviating) Windows codepages, traditional and simplified Chinese, half- and fullwidth Japanese, Korean via Wansung and Johab, and others I'm forgetting right now. And in newer version of the software, they also support Unicode, but using a custom encoding.

Thankfully a good chunk of all that is well documented, like the byte values introducing a fullwidth Japanese character, for example. But they don't describe what charset or encoding is actually used. EUC-JP? Shift-JIS? Turns out it's JIS X 0208. You'd think they would just use Shift-JIS, which gives them both full- and halfwidth Japanese in one shot, but no. They package those explicitly as JIS X 0208 and JIS X 0201. Similar questions arise for Chinese and the others. It took a lot of reverse engineering to figure that stuff out. But if you think that is hard, have fun finding tables to map those old encodings to Unicode and back. Java is a godsend in this case. Charset.availableCharsets has them all!

What's kinda charming is that TSS also contains text formatting commands. "Make all following text bold! Make it underlined! Now make it both bold and underlined!" Stuff like that.

What's less charming is that TSS is actually a superset (an extension of) the ISO-8859 family, similar to how ISO-8859 is a superset of ASCII. In other words, all ISO-8859-1 (or any other variant) is perfectly valid TSS, but not all TSS is valid ISO-8859-1. This creates a lot of fun meetings with other departments when they query the database and are puzzled as to where those weird characters in their ISO-8859-1 text came from.

Show HN: Source code graphRAG for Java/Kotlin development based on jQAssistant

Python Only Has One Real Competitor

Tmux to Zellij (and Back)

Ask HN: How are you using specialized agents to accelerate your work?

Passing user_id through 6 services? OTel Baggage fixes this

DavMail Pop/IMAP/SMTP/Caldav/Carddav/LDAP Exchange Gateway

Visual data modelling in the browser (open source)

Show HN: Tharos – CLI to find and autofix security bugs using local LLMs

Oddly Simple GUI Programs

The New Playbook for Leaders [pdf]

Interactive Unboxing of J Dilla's Donuts

OneCourt helps blind and low-vision fans to track Super Bowl live

Rudolf Vrba

Autism Incidence in Girls and Boys May Be Nearly Equal, Study Suggests

Wellness Hotels Discovery Application

NASA delays moon rocket launch by a month after fuel leaks during test

Sebastian Galiani on the Marginal Revolution

Ask HN: Are we at the point where software can improve itself?

Binance Gives Trump Family's Crypto Firm a Leg Up

Reverse engineering Chinese 'shit-program' for absolute glory: R/ClaudeCode

Indian Culture

Show HN: Maravel-Framework 10.61 prevents circular dependency

The age of a treacherous, falling dollar

Ask HN: AI Generated Diagrams

Microsoft Account bugs locked me out of Notepad – are Thin Clients ruining PCs?

Show HN: A delightful Mac app to vibe code beautiful iOS apps

Show HN: Gemini Station – A local Chrome extension to organize AI chats

Welfare states build financial markets through social policy design

Market orientation and national homicide rates

California urges people avoid wild mushrooms after 4 deaths, 3 liver transplants

Show HN: Source code graphRAG for Java/Kotlin development based on jQAssistant

Python Only Has One Real Competitor

Tmux to Zellij (and Back)

Ask HN: How are you using specialized agents to accelerate your work?

Passing user_id through 6 services? OTel Baggage fixes this

DavMail Pop/IMAP/SMTP/Caldav/Carddav/LDAP Exchange Gateway

Visual data modelling in the browser (open source)

Show HN: Tharos – CLI to find and autofix security bugs using local LLMs

Oddly Simple GUI Programs

The New Playbook for Leaders [pdf]

Interactive Unboxing of J Dilla's Donuts

OneCourt helps blind and low-vision fans to track Super Bowl live

Rudolf Vrba

Autism Incidence in Girls and Boys May Be Nearly Equal, Study Suggests

Wellness Hotels Discovery Application

NASA delays moon rocket launch by a month after fuel leaks during test

Sebastian Galiani on the Marginal Revolution

Ask HN: Are we at the point where software can improve itself?

Binance Gives Trump Family's Crypto Firm a Leg Up

Reverse engineering Chinese 'shit-program' for absolute glory: R/ClaudeCode

Indian Culture

Show HN: Maravel-Framework 10.61 prevents circular dependency

The age of a treacherous, falling dollar

Ask HN: AI Generated Diagrams

Microsoft Account bugs locked me out of Notepad – are Thin Clients ruining PCs?

Show HN: A delightful Mac app to vibe code beautiful iOS apps

Show HN: Gemini Station – A local Chrome extension to organize AI chats

Welfare states build financial markets through social policy design

Market orientation and national homicide rates

California urges people avoid wild mushrooms after 4 deaths, 3 liver transplants

What to know about encodings and character sets to work with text (2011)

Comments