UTF-8 is a brilliant design

https://iamvishnu.com/posts/utf8-is-brilliant-design

232•vishnuharidas•3h ago

Comments

happytoexplain•2h ago

I have a love-hate relationship with backwards compatibility. I hate the mess - I love when an entity in a position of power is willing to break things in the name of advancement. But I also love the cleverness - UTF-8, UTF-16, EAN, etc. To be fair, UTF-8 sacrifices almost nothing to achieve backwards compat though.

mort96•2h ago

Yeah I honestly don't know what I would change. Maybe replace some of the control characters with more common characters to save a tiny bit of space, if we were to go completely wild and break Unicode backward compatibility too. As a generic multi byte character encoding format, it seems completely optimal even in isolation.

amluto•2h ago

> To be fair, UTF-8 sacrifices almost nothing to achieve backwards compat though.

It sacrifices the ability to encode more than 21 bits, which I believe was done for compatibility with UTF-16: UTF-16’s awful “surrogate” mechanism can only express code units up to 2^21-1.

I hope we don’t regret this limitation some day. I’m not aware of any other material reason to disallow larger UTF-8 code units.

1oooqooq•2h ago

the limitation tomorrow will be today's implementations, sadly.

throw0101d•1h ago

> It sacrifices the ability to encode more than 21 bits, which I believe was done for compatibility with UTF-16: UTF-16’s awful “surrogate” mechanism can only express code units up to 2^21-1z

Yes, it is 'truncated' to the "UTF-16 accessible range":

* https://datatracker.ietf.org/doc/html/rfc3629#section-3

* https://en.wikipedia.org/wiki/UTF-8#History

Thompson's original design could handle up to six octets for each letter/symbol, with 31 bits of space:

* https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

mort96•1h ago

That isn't really a case of UTF-8 sacrificing anything to be compatible with UTF-16. It's Unicode, not UTF-8 that made the sacrifice: Unicode is limited to 21 bits due to UTF-16. The UTF-8 design trivially extends to support 6 byte long sequences supporting up to 31-bit numbers. But why would UTF-8, a Unicode character encoding, support code points which Unicode has promised will never and can never exist?

MyOutfitIsVague•1h ago

In an ideal future (read: fantasy), utf-16 gets formally deprecated and trashed, freeing the surrogate sequences and full range for utf-8.

Or utf-16 is officially considered a second class citizen, and some code points are simply out of its reach.

Analemma_•1h ago

It's always dangerous to stick one's neck out and say "[this many bits] ought to be enough for anybody", but I think it's very unlikely we'll ever run out of UTF-8 sequences. UTF-8 can represent about 1.1 million code points, of which we've assigned about 160,000 actual characters, plus another ~140,000 in the Private Use Area, which won't expand. And that's after getting nearly all of the world's known writing systems: the last several Unicode updates have added a few thousand characters here and there for very obscure and/or ancient writing systems, but those won't go on forever (and things like emojis rarely only get a handful of new code points per update, because most new emojis are existing code points with combining characters).

If I had to guess, I'd say we'll run out of IPv6 addresses before we run out of unassigned UTF-8 sequences.

layer8•20m ago

That limitation will be trivial to lift once UTF-16 compatibility can be disregarded. This won’t happen soon, of course, given JavaScript and Windows, but the situation might be different in a hundred or thousand years. Until then, we still have a lot of unassigned code points.

In addition, it would be possible to nest another surrogate-character-like scheme into UTF-16 to support a larger character set.

procaryote•1h ago

> I love when an entity in a position of power is willing to break things in the name of advancement.

It's less fun when you have things that need to keep working break because someone felt like renaming a parameter, or that a part of the standard library looks "untidy"

happytoexplain•40m ago

I agree! And yet I lovingly sacrifice my man-hours to it when I decide to bump that major version number in my dependency manifest.

3pt14159•2h ago

I remember a time before UTF-8's ubiquity. It was such a headache moving to i18z. I love UTF-8.

glxxyz•2h ago

I worked on an email client. Many many character set headaches.

linguae•2h ago

I remember learning Japanese in the early 2000s and the fun of dealing with multiple encodings for the same language: JIS, Shift-JIS, and EUC. As late as 2011 I had to deal with processing a dataset encoded under EUC in Python 2 for a graduate-level machine learning course where I worked on a project for segmenting Japanese sentences (typically there are no spaces in Japanese sentences).

UTF-8 made processing Japanese text much easier! No more needing to manually change encoding options in my browser! No more mojibake!

layer8•17m ago

On the other hand, you now have to deal with the issues of Han unification: https://en.wikipedia.org/wiki/Han_unification#Examples_of_la...

bruce511•2h ago

While the backward compatibility of utf-8 is nice, and makes adoption much easier, the backward compatibility does not come at any cost to the elegance of the encoding.

In other words, yes it's backward compatible, but utf-is also compact and elegant even without that.

nextaccountic•2h ago

UTF-8 also enables this mindblowing design for small string optimization - if the string has 24 bytes or less it is stored inline, otherwise it is stored on the heap (with a pointer, a length, and a capacity - also 24 bytes)

https://github.com/ParkMyCar/compact_str

How cool is that

(Discussed here https://news.ycombinator.com/item?id=41339224)

adgjlsfhk1•58m ago

How is that UTF8 specific?

quectophoton•2h ago

Having the continuation bytes always start with the bits `10` also make it possible to seek to any random byte, and trivially know if you're at the beginning of a character or at a continuation byte like you mentioned, so you can easily find the beginning of the next or previous character.

If the characters were instead encoded like EBML's variable size integers[1] (but inverting 1 and 0 to keep ASCII compatibility for the single-byte case), and you do a random seek, it wouldn't be as easy (or maybe not even possible) to know if you landed on the beginning of a character or in one of the `xxxx xxxx` bytes.

[1]: https://www.rfc-editor.org/rfc/rfc8794#section-4.4

1oooqooq•2h ago

so you replace one costly sweeping with a costly sweeping. i wouldn't call that an advantage in any way over junping n bytes.

what you describe is the bare minimum so you even know what you are searching for while you scan pretty much everything every time.

hk__2•1h ago

What do you mean? What would you suggest instead? Fixed-length encoding? It would take a looot of space given all the character variations you can have.

gertop•1h ago

UTF-16 is both simpler to parse and more compact than utf-8 when writing non-english characters.

UTF-8 didn't win on technical merits, it won becausw it was mostly backwards compatible with all American software that previously used ASCII only.

When you leave the anglosphere you'll find that some languages still default to other encodings due to how large utf-8 ends up for them (Chinese and Japanese, to name two).

ISV_Damocles•1h ago

UTF-16 is also just as complicated as UTF-8 requiring multibyte characters to cover the entirety of Unicode, so it doesn't avoid the issue you're complaining about for the newest languages added, and it has the added complexity of a BOM being required to be sure you have the pairs of bytes in the right order, so you are more vulnerable to truncated data being unrecoverable versus UTF-8.

UTF-32 would be a fair comparison, but it is 4 bytes per character and I don't know what, if anything, uses it.

adgjlsfhk1•34m ago

python does (although it will use 8 or 16 bits per character if all characters in the string fit)

adgjlsfhk1•1h ago

With BOM issues, UTF-16 is way more complicated. For Chinese and Japenese, UTF8 is a maximum of 50% bigger, but can actually end up smaller if used within standard file formats like JSON/HTML since all the formatting characters and spaces are single bytes.

syncsynchalt•26m ago

UTF-16 has endian concerns and surrogates.

Both UTF-8 and UTF-16 have negatives but I don't think UTF-16 comes out ahead.

cyphar•15m ago

UTF-16 is absolutely not easier to work with. The vast majority of bugs I remember having to fix that were directly related to encoding were related to surrogate pairs. I suspect most programs do not handle them correctly because they come up so rarely but the bugs you see are always awful. UTF-8 doesn't have this problem and I think that's enough reason to avoid UTF-16 (though "good enough" compatibility with programs that only understand 8-bit-clean ASCII is an even better practical reason). Byte ordering is also a pernicious problem (with failure modes like "all of my documents are garbled") that UTF-8 also completely avoids.

It is more compact for most (but not all) CJK characters, but that's not the case for all non-English characters. However, one important thing to remember is that most computer-based documents contain large amounts of ASCII text purely because the formats themselves use English text and ASCII punctuation. I suspect that most UTF-8 files with CJK contents are much smaller than UTF-16 files, but I'd be interested in an actual analysis from different file formats.

The size argument (along with a lot of understandable contention around UniHan) is one of the reasons why UTF-8 adoption was slower in Japan and Shift-JIS is not completely dead (though mainly for esoteric historical reasons like the 漢検 test rather than active or intentional usage) but this is quite old history at this point. UTF-8 now makes up 99% of web pages.

PaulHoule•1h ago

It's not uncommon when you want variable length encodings to write the number of extension bytes used in unary encoding

https://en.wikipedia.org/wiki/Unary_numeral_system

and also use whatever bits are left over encoding the length (which could be in 8 bit blocks so you write 1111/1111 10xx/xxxx to code 8 extension bytes) to encode the number. This is covered in this CS classic

https://archive.org/details/managinggigabyte0000witt

together with other methods that let you compress a text + a full text index for the text into less room than text and not even have to use a stopword list. As you say, UTF-8 does something similar in spirit but ASCII compatible and capable of fast synchronization if data is corrupted or truncated.

Animats•1h ago

Right. That's one of the great features of UTF-8. You can move forwards and backwards through a UTF-8 string without having to start from the beginning.

Python has had troubles in this area. Because Python strings are indexable by character, CPython used wide characters. At one point you could pick 2-byte or 4-byte characters when building CPython. Then that switch was made automatic at run time. But it's still wide characters, not UTF-8. One emoji and your string size quadruples.

I would have been tempted to use UTF-8 internally. Indices into a string would be an opaque index type which behaved like an integer to the extent that you could add or subtract small integers, and that would move you through the string. If you actually converted the opaque type to a real integer, or tried to subscript the string directly, an index to the string would be generated. That's an unusual case. All the standard operations, including regular expressions, can work on a UTF-8 representation with opaque index objects.

btown•1h ago

This is Python; finding new ways to subscript into things directly is a graduate student’s favorite pastime!

In all seriousness I think that encoding-independent constant-time substring extraction has been meaningful in letting researchers outside the U.S. prototype, especially in NLP, without worrying about their abstractions around “a 5 character subslice” being more complicated than that. Memory is a tradeoff, but a reasonably predictable one.

nostrademons•1h ago

PyCompactUnicodeObject was introduced with Python 3.3, and uses UTF-8 internally. It's used whenever both size and max code point are known, which is most cases where it comes from a literal or bytes.decode() call. Cut memory usage in typical Django applications by 2/3 when it was implemented.

https://peps.python.org/pep-0393/

I would probably use UTF-8 and just give up on O(1) string indexing if I were implementing a new string type. It's very rare to require arbitrary large-number indexing into strings. Most use-cases involve chopping off a small prefix (eg. "hex_digits[2:]") or suffix (eg. "filename[-3:]"), and you can easily just linear search these with minimal CPU penalty. Or they're part of library methods where you want to have your own custom traversals, eg. .find(substr) can just do Boyer-Moore over bytes, .split(delim) probably wants to do a first pass that identifies delimiter positions and then use that to allocate all the results at once.

deepsun•1h ago

That's assuming the text is not corrupted or maliciously modified. There were (are) _numerous_ vulnerabilities due to parsing/escaping of invalid UTF-8 sequences.

Quick googling (not all of them are on-topic tho):

https://www.rapid7.com/blog/post/2025/02/13/cve-2025-1094-po...

https://www.cve.org/CVERecord/SearchResults?query=utf-8

s1mplicissimus•22m ago

I was just wondering a similar thing: If 10 implies start of character, doesn't that require 10 to never occur inside the other bits of a character?

thesz•1h ago

> so you can easily find the beginning of the next or previous character.

It is not true [1]. While it is not UTF-8 problem per se, it is a problem of how UTF-8 is being used.

[1] https://paulbutler.org/2025/smuggling-arbitrary-data-through...

layer8•1h ago

Parent means “character” as defined here in Unicode: https://www.unicode.org/versions/Unicode17.0.0/core-spec/cha..., effectively code points. Meanings 2 and 3 in the Unicode glossary here: https://www.unicode.org/glossary/#character

procaryote•1h ago

also, the redundancy means that you get a pretty good heuristic for "is this utf-8". Random data or other encodings are pretty unlikely to also be valid utf-8, at least for non-tiny strings

spankalee•7m ago

Wouldn't you only need to read backwards at most 3 bytes to see if you were currently at a continuation byte? With a max multi-byte size of 4 bytes, if you don't see a multi-byte start character by then you would know it's a single-byte char.

I wonder if a reason is similar though: error recovery when working with libraries that aren't UTF-8 aware. If you slice naively slice an array of UTF-8 bytes, a UTf-8 aware library can ignore malformed leading and trailing bytes and get some reasonable string out of it.

jancsika•5m ago

> Having the continuation bytes always start with the bits `10` also make it possible to seek to any random byte, and trivially know if you're at the beginning of a character or at a continuation byte like you mentioned, so you can easily find the beginning of the next or previous character.

Given four byte maximum, it's a similarly trivial algo for the other case you mention.

The main difference I see is that UTF8 increases the chance of catching and flagging an error in the stream. E.g., any non-ASCII byte that is missing from the stream is highly likely to cause an invalid sequence. Whereas with the other case you mention the continuation bytes would cause silent errors (since an ASCII character would be indecipherable from continuation bytes).

Encoding gurus-- am I right?

twbarr•2h ago

It should be noted that the final design for UTF-8 was sketched out on a placemat by Rob Pike and Ken Thompson.

hu3•1h ago

I wonder if that placemat still exists today. It would be such an important piece of computer history.

ot•7m ago

> It was so easy once we saw it that there was no reason to keep the placemat for notes, and we left it behind. Or maybe we did bring it back to the lab; I'm not sure. But it's gone now.

https://commandcenter.blogspot.com/2020/01/utf-8-turned-20-y...

cyberax•2h ago

UTF-8 is simply genius. It entirely obviated the need for clunky 2-byte encodings (and all the associated nonsense about byte order marks).

The only problem with UTF-8 is that Windows and Java were developed without knowledge about UTF-8 and ended up with 16-bit characters.

Oh yes, and Python 3 should have known better when it went through the string-bytes split.

wrs•2h ago

UTF-16 made lots of sense at the time because Unicode thought "65,536 characters will be enough for anybody" and it retains the 1:1 relationship between string elements and characters that everyone had assumed for decades. I.e., you can treat a string as an array of characters and just index into it with an O(1) operation.

As Unicode (quickly) evolved, it turned out not that only are there WAY more than 65,000 characters, there's not even a 1:1 relationship between code points and characters, or even a single defined transformation between glyphs and code points, or even a simple relationship between glyphs and what's on the screen. So even UTF-32 isn't enough to let you act like it's 1980 and str[3] is the 4th "character" of a string.

So now we have very complex string APIs that reflect the actual complexity of how human language works...though lots of people (mostly English-speaking) still act like str[3] is the 4th "character" of a string.

UTF-8 was designed with the knowledge that there's no point in pretending that string indexing will work. Windows, MacOS, Java, JavaScript, etc. just missed the boat by a few years and went the wrong way.

rowls66•1h ago

I think more effort should have been made to live with 65,536 characters. My understanding is that codepoints beyond 65,536 are only used for languages that are no longer in use, and emojis. I think that adding emojis to unicode is going to be seen a big mistake. We already have enough network bandwith to just send raster graphics for images in most cases. Cluttering the unicode codespace with emojis is pointless.

dudeinjapan•1h ago

CJK unification (https://en.wikipedia.org/wiki/CJK_Unified_Ideographs) i.e. combining "almost same" Chinese/Japanese/Korean characters into the same codepoint, was done for this reason, and we are now living with the consequence that we need to load separate Traditional/Simplified Chinese, Japanese, and Korean fonts to render each language. Total PITA for apps that are multi-lingual.

mort96•1h ago

This feels like it should be solveable with introducing a few more marker characters, like one code point representing "the following text is traditional Chinese", "the following text is Japanese", etc? It would add even more statefulness to Unicode, but I feel like that ship has already sailed with the U+202D LEFT-TO-RIGHT OVERRIDE and U+202E RIGHT-TO-LEFT OVERRIDE characters...

fanf2•54m ago

Unicode used to have a system of in-band language tags, but it was deprecated https://www.unicode.org/faq//languagetagging.html

mort96•1h ago

The silly thing is, lots of emoji these days aren't even a single code point. So many emoji these days are two other code points combined with a zero width joiner. Surely we could've introduced one code point which says "the next code point represents an emoji from a separate emoji set"?

duskwuff•1h ago

Your understanding is incorrect; a substantial number of the ranges allocated outside BMP (i.e. above U+FFFF) are used for CJK ideographs which are uncommon, but still in use, particularly in names and/or historical texts.

gred•1h ago

> My understanding is that codepoints beyond 65,536 are only used for languages that are no longer in use, and emojis

This week's Unicode 17 announcement [1] mentions that of the ~160k existing codepoints, over 100k are CJK codepoints, so I don't think this can be true...

[1] https://blog.unicode.org/2025/09/unicode-170-release-announc...

daneel_w•1h ago

I entirely agree that we could've cared better for the leading 16 bit space. But adding a second component (images) to the concept of textual strings would've been a terrible choice.

The grande crime was that we squandered the space we were given by placing emojis outside the UTF-8 specification, where we already had a whooping 1.1 million code points at our disposal.

jasonwatkinspdx•9m ago

You are mistaken. Chinese Hanzi and the languages that derive from or incorporate them require way more than 65,536 code points. In particular a lot of these characters are formal family or place names. USC-2 failed because it couldn't represent these, and people using these languages justifiably objected to having to change how their family name is written to suit computers, vs computers handling it properly.

This "two bytes should be enough" mistake was one of the biggest blind spots in Unicode's original design, and is cited as an example of how standards groups can have cultural blind spots.

wongarsu•2h ago

Yeah, Java and Windows NT3.1 had really bad timing. Both managed to include Unicode despite starting development before the Unicode 1.0 release, but both added unicode back when Unicode was 16 bit and the need for something like UTF-8 was less clear

hyperman1•2h ago

One thing I always wonder: It is possible to encode a unicode codepoint with too much bytes. UTF-8 forbids these, only the shortest one is valid. E.g 00000001 is the same as 11000000 10000001.

So why not make the alternatives impossible by adding the start of the last valid option? So 11000000 10000001 would give codepoint 128+1 as values 0 to 127 are already covered by a 1 byte sequence.

The advantages are clear: No illegal codes, and a slightly shorter string for edge cases. I presume the designers thought about this, so what were the disadvantages? The required addition being an unacceptable hardware cost at the time?

UPDATE: Last bitsequence should of course be 10000001 and not 00000001. Sorry for that. Fixed it.

rhet0rica•2h ago

See quectophoton's comment—the requirement that continuation bytes are always tagged with a leading 10 is useful if a parser is jumping in at a random offset—or, more commonly, if the text stream gets fragmented. This was actually a major concern when UTF-8 was devised in the early 90s, as transmission was much less reliable than it is today.

umanwizard•2h ago

Because then it would be impossible to tell from looking at a byte whether it is the beginning of a character or not, which is a useful property of UTF-8.

rightbyte•2h ago

I think that would garble random access?

variadix•2h ago

https://en.m.wikipedia.org/wiki/Self-synchronizing_code

nostrademons•2h ago

I assume you mean "11000000 10000001" to preserve the property that all continuation bytes start with "10"? [Edit: looks like you edited that in]. Without that property, UTF-8 loses self-synchronicity, the property that given a truncated UTF-8 stream, you can always find the codepoint boundaries, and will lose at most codepoint worth rather than having the whole stream be garbled.

In theory you could do it that way, but it comes at the cost of decoder performance. With UTF-8, you can reassemble a codepoint from a stream using only fast bitwise operations (&, |, and <<). If you declared that you had to subtract the legal codepoints represented by shorter sequences, you'd have to introduce additional arithmetic operations in encoding and decoding.

toast0•1h ago

The siblings so far talk about the synchronizing nature of the indicators, but that's not relevant to your question. Your question is more of

Why is U+0080 encoded as c2 80, instead of c0 80, which is the lowest sequence after 7f?

I suspect the answer is

a) the security impacts of overlong encodings were not contemplated; lots of fun to be had there if something accepts overlong encodings but is scanning for things with only shortest encodings

b) utf-8 as standardized allows for encode and decode with bitmask and bitshift only. Your proposed encoding requires bitmask and bitshift, in addition to addition and subtraction

You can find a bit of email discussion from 1992 here [1] ... at the very bottom there's some notes about what became utf-8:

> 1. The 2 byte sequence has 2^11 codes, yet only 2^11-2^7 are allowed. The codes in the range 0-7f are illegal. I think this is preferable to a pile of magic additive constants for no real benefit. Similar comment applies to all of the longer sequences.

The included FSS-UTF that's right before the note does include additive constants.

[1] https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

hyperman1•1h ago

Oops yeah. One of my bit sequences is of course wrong and seems to have derailed this discussion. Sorry for that. Your interpretation is correct.

I've seen the first part of that mail, but your version is a lot longer. It is indeed quite convincing in declaring b) moot. And security was not that big of a thing then as it is now, so you're probalbly right

layer8•31m ago

A variation of a) is comparing strings as UTF-8 byte sequences if overlong encodings are also accepted (before and/or later). This leads to situations where strings tested as unequal are actually equal in terms of code points.

rmccue•2h ago

Love the UTF-8 playground that's linked: https://utf8-playground.netlify.app/

Would be great if it was possible to enter codepoints directly; you can do it via the URL (`/F8FF` eg), but not in the UI. (Edit, the future is now. https://github.com/vishnuharidas/utf8-playground/pull/6)

vishnuharidas•1h ago

Thanks for the contribution, this is now merged and live.

alberth•2h ago

I’ve re-read so many times Joel’s article on Unicode. It’s also very helpful.

https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...

burtekd•2h ago

I'm just gonna leave this here too: https://www.youtube.com/watch?v=MijmeoH9LT4

quotemstr•2h ago

Great example of a technology you get from a brilliant guy with a vision and that you'll never get out of a committee.

librasteve•1h ago

some insightful unicode regex examples...

https://dev.to/bbkr/utf-8-internal-design-5c8b

nostrademons•1h ago

For more on UTF-8's design, see Russ Cox's one-pager on it:

https://research.swtch.com/utf8

And Rob Pike's description of the history of how it was designed:

https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

twoodfin•1h ago

UTF-8 is indeed a genius design. But of course it’s crucially dependent on the decision for ASCII to use only 7 bits, which even in 1963 was kind of an odd choice.

Was this just historical luck? Is there a world where the designers of ASCII grabbed one more bit of code space for some nice-to-haves, or did they have code pages or other extensibility in mind from the start? I bet someone around here knows.

colejohnson66•1h ago

The idea was that the free bit would be repurposed, likely for parity.

EGreg•1h ago

I would love to think this is true, and it makes sense, but do you have any actual evidence for this you could share with HN?

KPGv2•1h ago

This is not true. ASCII (technically US-ASCII) was a fixed-width encoding of 7 bits. There was no 8th bit reserved. You can read the original standard yourself here: https://ia600401.us.archive.org/23/items/enf-ascii-1968-1970...

Crucially, "the 7-bit coded character set" is described on page 6 using only seven total bits (1-indexed, so don't get confused when you see b7 in the chart!).

There is an encoding mechanism to use 8 bits, but it's for storage on a type of magnetic tape, and even that still is silent on the 8th bit being repurposed. It's likely, given the lack of discussion about it, that it was for ergonomic or technical purposes related to the medium (8 is a power of 2) rather than for future extensibility.

kbolino•1h ago

Notably, it is mentioned that the 7-bit code is developed "in anticipation of" ISO requesting such a code, and we see in the addenda attached at the end of the document that ISO began to develop 8-bit codes extending the base 7-bit code shortly after it was published.

So, it seems that ASCII was kept to 7 bits primarily so "extended ASCII" sets could exist, with additional characters for various purposes (such as other languages, but also for things like mathematical symbols).

layer8•39m ago

When ASCII was invented, 36-bit computers were popular, which would fit five ASCII characters with just one unused bit per 36-bit word. Before, 6-bit character codes were used, where a 36-bit word could fit six of them.

mort96•1h ago

I don't know if this is the reason or if the causality goes the other way, but: it's worth noting that we didn't always have 8 general purpose bits. 7 bits + 1 parity bit or flag bit or something else was really common (enough so that e-mail to this day still uses quoted-printable [1] to encode octets with 7-bit bytes). A communication channel being able to transmit all 8 bits in a byte unchanged is called being 8-bit clean [2], and wasn't always a given.

In a way, UTF-8 is just one of many good uses for that spare 8th bit in an ASCII byte...

[1] https://en.wikipedia.org/wiki/Quoted-printable

[2] https://en.wikipedia.org/wiki/8-bit_clean

ajross•49m ago

"Five characters in a 36 bit word" was a fairly common trick on pre-byte architectures too.

bear8642•13m ago

5 characters?

I thought it was normally six 6bit characters?

spydum•1h ago

https://www.sensitiveresearch.com/Archive/CharCodeHist/X3.4-...

Looks to me like serendipity - they thought 8 bits would be wasteful, they didnt have a need for that many characters.

KPGv2•1h ago

Historical luck. Though "luck" is probably pushing it in the way one might say certain math proofs are historically "lucky" based on previous work. It's more an almost natural consequence.

Before ASCII there was BCDIC, which was six bits and non-standardized (there were variants, just like technically there are a number of ASCII variants, with the common just referred to as ASCII these days).

BCDIC was the capital English letters plus common punctuation plus numbers. 2^6 is 64, and for capital letters + numbers, you have 36, plus a few common punctuation marks puts you around 50. IIRC the original by IBM was around 45 or something. Slash, period, comma, tc.

So when there was a decision to support lowercase, they added a bit because that's all that was necessary, and I think the printers around at the time couldn't print anything but something less than 128 characters anyway. There wasn't any ó or ö or anything printable, so why support it?

But eventually that yielded to 8-bit encodings (various ASCIIs like latin-1 extended, etc. that had ñ etc.).

Crucially, UTF-8 is only compatible with the 7-bit ASCII. All those 8-bit ASCIIs are incompatible with UTF-8 because they use the eighth bit.

michaelsshaw•1h ago

I'm not sure, but it does seem like a great bit of historical foresight. It stands as a lesson to anyone standardizing something: wanna use a 32 bit integer? Make it 31 bits. Just in case. Obviously, this isn't always applicable (e.g. sizes, etc..), but the idea of leaving even the smallest amount of space for future extensibility is crucial.

jasonwatkinspdx•1h ago

Not an expert but I happened to read about some of the history of this a while back.

ASCII has its roots in teletype codes, which were a development from telegraph codes like Morse.

Morse code is variable length, so this made automatic telegraph machines or teletypes awkward to implement. The solution was the 5 bit Baudot code. Using a fixed length code simplified the devices. Operators could type Baudot code using one hand on a 5 key keyboard. Part of the code's design was to minimize operator fatigue.

Baudot code is why we refer to the symbol rate of modems and the like in Baud btw.

Anyhow, the next change came with instead of telegraph machines directly signaling on the wire, instead a typewriter was used to create a punched tape of codepoints, which would be loaded into the telegraph machine for transmission. Since the keyboard was now decoupled from the wire code, there was more flexibility to add additional code points. This is where stuff like "Carriage Return" and "Line Feed" originate. This got standardized by Western Union and internationally.

By the time we get to ASCII, teleprinters are common, and the early computer industry adopted punched cards pervasively as an input format. And they initially did the straightforward thing of just using the telegraph codes. But then someone at IBM came up with a new scheme that would be faster when using punch cards in sorting machines. And that became ASCII eventually.

So zooming out here the story is that we started with binary codes, then adopted new schemes as technology developed. All this happened long before the digital computing world settled on 8 bit bytes as a convention. ASCII as bytes is just a practical compromise between the older teletype codes and the newer convention.

pcthrowaway•8m ago

> But then someone at IBM came up with a new scheme that would be faster when using punch cards in sorting machines. And that became ASCII eventually.

Technically, the punch card processing technology was patented by inventor Herman Hollerith in 1884, and the company he founded wouldn't become IBM until 40 years later (though it was folded with 3 other companies into the Computing-Tabulating-Recording company in 1911, which would then become IBM in 1924).

To be honest though, I'm not clear how ASCII came from anything used by the punch card sorting machines, since it wasn't proposed until 1961 (by an IBM engineer, but 32 years after Hollerith's death). Do you know where I can read more about the progression here?

toast0•1h ago

7 bits isn't that odd. Bauddot was 5 bits, and found insufficient, so 6 bit codes were developed; they were found insufficient, so 7-bit ASCII was developed.

IBM had standardized 8-bit bytes on their System/360, so they developed the 8-bit EBCDIC encoding. Other computing vendors didn't have consistent byte lengths... 7-bits was weird, but characters didn't necessarily fit nicely into system words anyway.

layer8•51m ago

The use of 8-bit extensions of ASCII (like the ISO 8859-x family) was ubiquitous for a few decades, and arguably still is to some extent on Windows (the standard Windows code pages). If ASCII had been 8-bit from the start, but with the most common characters all within the first 128 integers, which would seem likely as a design, then UTF-8 would still have worked out pretty well.

The accident of history is less that ASCII happens to be 7 bits, but that the relevant phase of computer development happened to primarily occur in an English-speaking country, and that English text happens to be well representable with 7-bit units.

sheerun•1h ago

I'll mention IPv6 as bad design that could have been potentially UTF-8-like success story

tialaramex•1h ago

No. UTF-8 is for encoding text, so we don't need to care about it being variable length because text was already variable length.

The network addresses aren't variable length, so if you decide "Oh IPv6 is variable length" then you're just making it worse with no meaningful benefit.

The IPv4 address is 32 bits, the IPv6 address is 128 bits. You could go 64 but it's much less clear how to efficiently partition this and not regret whatever choices you do make in the foreseeable future. The extra space meant IPv6 didn't ever have those regrets.

It suits a certain kind of person to always pay $10M to avoid the one-time $50M upgrade cost. They can do this over a dozen jobs in twenty years, spending $200M to avoid $50M cost and be proud of saving money.

zamalek•1h ago

Even for varints (you could probably drop the intermediate prefixes for that). There are many examples of using SIMD to decode utf-8, where-as the more common protobuf scheme is known to be hostile to SIMD and the branch predictor.

postalrat•1h ago

Looks similar to midi

dotslashmain•1h ago

Rob Pike and Ken Thompson are brilliant computer scientists & engineers.

dpc_01234•1h ago

UTF-8 is a undeniably a good answer, but to a relatively simple bit twiddling / variable len integer encoding problem in a somewhat specific context.

I realize that hindsight is 20/20, and time were different, but lets face it: "how to use an unused top bit to best encode larger number representing Unicode" is not that much of challenge, and the space of practical solutions isn't even all that large.

ivanjermakov•1h ago

I just realised that all latin text is wasting 12% of storage/memory/bandwidth with MSB zero. At least is compresses well. Are there any technology that utilizes 8th bit for something useful, e.g. error checking?

tmiku•14m ago

See mort96's comments about 7-bit ASCII and parity bits (https://news.ycombinator.com/item?id=45225911). Kind of archaic now, though - 8-bit bytes with the error checking living elsewhere in the stack seems to be preferred.

Tuna-Fish•1h ago

Except that there were many different solutions before UTF-8, all of which sucked really badly.

UTF-8 is the best kind of brilliant. After you've seen it, you (and I) think of it as obvious, and clearly the solution any reasonable engineer would come up with. Except that it took a long time for it to be created.

carlos256•1h ago

No, it's not. It's just a form of Elias-Gamma coding.

carlos256•1h ago

* unary encoding coding.

ummonk•1h ago

> Another one is the ISO/IEC 8859 encodings are single-byte encodings that extend ASCII to include additional characters, but they are limited to 256 characters.

ISO 2022 allowed you to use control codes to switch between ISO 8859 character sets though, allowing for mixed script text streams.

Dwedit•51m ago

Meanwhile Shift-JIS has a bad design, since the second byte of a character can be any ASCII character 0x40-0x9E. This includes brackets, backslash, caret, backquote, curly braces, pipe, and tilde. This can cause a path separator or math operator to appear in text that is encoded as Shift-JIS but interpreted as plain ASCII.

UTF-8 basically learned from the mistakes of previous encodings which allowed that kind of thing.

tiahura•41m ago

How many llm tokens are wasted everyday resolving utf issues?

vintermann•36m ago

UTF-8 is as good as a design as could be expected, but Unicode has scope creep issues. What should be in Unicode?

Coming at it naively, people might think the scope is something like "all sufficiently widespread distinct, discrete glyphs used by humans for communication that can be printed". But that's not true, because

* It's not discrete. Some code points are for combining with other code points.

* It's not distinct. Some glyphs can be written in multiple ways. Some glyphs which (almost?) always display the same, have different code points and meanings.

* It's not all printable. Control characters are in there - they pretty much had to be due to compatibility with ASCII, but they've added plenty of their own.

I'm not aware of any Unicode code points that are animated - at least what's printable, is printable on paper and not just on screen, there are no marquee or blink control characters, thank God. But, who knows when that invariant will fall too.

By the way, I know of one utf encoding the author didn't mention, utf-7. Like utf-8, but assuming that the last bit wasn't safe to use (apparently a sensible precaution over networks in the 80s). My boss managed to send me a mail encoded in utf-7 once, that's how I know what it is. I don't know how he managed to send it, though.

anthonyiscoding•17m ago

UTF-8 contributors are some of our modern day unsung heroes. The design is brilliant but the dedication to encode every single way humans communicate via text into a single standard, and succeed at it, is truly on another level.

Most other standards just do the xkcd thing: "now there's 15 competing standards"

sawyna•8m ago

I have always wondered - what if the utf-8 space is filled up? Does it automatically promote to having a 5th byte? Is that part of the spec? Or are we then talking about utf-16?

UTF-8 is a brilliant design

QGIS is a free, open-source, cross platform geographical information system

Many hard LeetCode problems are easy constraint problems

EU court rules nuclear energy is clean energy

The treasury is expanding the Patriot Act to attack Bitcoin self custody

Rust: A quest for performant, reliable software [video]

3D modeling with paper

Corporations are trying to hide job openings from US citizens

First 'perovskite camera' can see inside the human body

How FOSS Projects Handle Legal Takedown Requests

Humanely dealing with humungus crawlers

Qwen3-Next

Vector database that can index 1B vectors in 48M

How to Become a Pure Mathematician (Or Statistician)

Windows-Use: an AI agent that interacts with Windows at GUI layer

Polylaminin, a drug considered capable of reversing spinal cord injury

Oq: Terminal OpenAPI Spec Viewer

Why do browsers throttle JavaScript timers?

I made a small site to share text and files for free, no ads, no registration

Building a Deep Research Agent Using MCP-Agent

Advanced Scheme Techniques (2004) [pdf]

OpenAI Grove

VaultGemma: The most capable differentially private LLM

K2-think: A parameter-efficient reasoning system

Show HN: 47jobs – A Fiverr/Upwork for AI Agents

Doom-ada: Doom Emacs Ada language module with syntax, LSP and Alire support

Racintosh Plus – Rackmount Mac Plus

Wysiwid: What you see is what it does

I don't like curved displays

Power series, power serious (1999) [pdf]

UTF-8 is a brilliant design

QGIS is a free, open-source, cross platform geographical information system

Many hard LeetCode problems are easy constraint problems

EU court rules nuclear energy is clean energy

The treasury is expanding the Patriot Act to attack Bitcoin self custody

Rust: A quest for performant, reliable software [video]

3D modeling with paper

Corporations are trying to hide job openings from US citizens

First 'perovskite camera' can see inside the human body

How FOSS Projects Handle Legal Takedown Requests

Humanely dealing with humungus crawlers

Qwen3-Next

Vector database that can index 1B vectors in 48M

How to Become a Pure Mathematician (Or Statistician)

Windows-Use: an AI agent that interacts with Windows at GUI layer

Polylaminin, a drug considered capable of reversing spinal cord injury

Oq: Terminal OpenAPI Spec Viewer

Why do browsers throttle JavaScript timers?

I made a small site to share text and files for free, no ads, no registration

Building a Deep Research Agent Using MCP-Agent

Advanced Scheme Techniques (2004) [pdf]

OpenAI Grove

VaultGemma: The most capable differentially private LLM

K2-think: A parameter-efficient reasoning system

Show HN: 47jobs – A Fiverr/Upwork for AI Agents

Doom-ada: Doom Emacs Ada language module with syntax, LSP and Alire support

Racintosh Plus – Rackmount Mac Plus

Wysiwid: What you see is what it does

I don't like curved displays

Power series, power serious (1999) [pdf]

UTF-8 is a brilliant design

Comments