It’s also not clear to me that the code point is a good abstraction in the design of UTF8. Usually, what you want is either the byte or the grapheme cluster.
Exactly ! That's what I understood after reading this great post https://tonsky.me/blog/unicode/
"Even in the widest encoding, UTF-32, [some grapheme] will still take three 4-byte units to encode. And it still needs to be treated as a single character. If the analogy helps, we can think of the Unicode itself (without any encodings) as being variable-length."
I tend to think it's the biggest design decision in Unicode (but maybe I just don't fully see the need and use-cases beyond emojis. Of course I read the section saying it's used in actual languages, but the few examples described could have been made with a dedicated 32 bits codepoint...)
For example: in English, U+0049 and U+0069 ("I" and "i") are considered an uppercase/lowercase pair. In the Turkish locale, these are considered two separate characters with their own uppercase and lowercase versions: U+0049/U+0130 ("I" / "ı") and U+0131/U+0069 ("İ" / "i").
2) I really doubt that the current upper limit of U+10_FFFF is going to need to be raised. Past growth in the Unicode standard has primarily been driven by the addition of more CJK characters; that isn't going to continue indefinitely.
3) Disallowing C0 characters like U+0009 (horizontal tab) is absurd, especially at the level of a text encoding.
4) BOMs are dumb. We learned that lesson in the early 2000s - even if they sound great as a way of identifying text encodings, they have a nasty way of sneaking into the middle of strings and causing havoc. Bringing them back is a terrible idea.
I'm not super thrilled with the extensions, though. They jump directly from 36 bits to 63/71 bits with nothing in between and then use a complicated scheme to go further.
Compare to WTF-8, which solves a different problem (representing invalid 16-bit characters within an 8-bit encoding).
It's literally the exact opposite of this proposal, in that there's an actual concrete problem and how to make it not a problem. This one is a list of weird grievances that aren't actually problems for anyone, like the max code point number.
Tl;dr: Since we're kinda stuck with Uncorrected UTF-8, here are the "characters" you shouldn't use. Includes a bunch of stuff the OP mentioned.
unicode-assignable =
%x9 / %xA / %xD / ; useful controls
%x20-7E / ; exclude C1 controls and DEL
%xA0-D7FF / ; exclude surrogates
%xE000-FDCF / ; exclude FDD0 nonchars
%xFDF0-FFFD / ; exclude FFFE and FFFF nonchars
%x10000-1FFFD / %x20000-2FFFD / ; (repeat per plane)
%x30000-3FFFD / %x40000-4FFFD /
%x50000-5FFFD / %x60000-6FFFD /
%x70000-7FFFD / %x80000-8FFFD /
%x90000-9FFFD / %xA0000-AFFFD /
%xB0000-BFFFD / %xC0000-CFFFD /
%xD0000-DFFFD / %xE0000-EFFFD /
%xF0000-FFFFD / %x100000-10FFFD
I don't understand the desire to make existing characters unrepresentable for the sake of what? Shifting used characters earlier in the byte sequence?
I think the offset scheme should only be used to fix overlength encodings, and not trying to patch over an adhoc hole at the same time. It seems safer to make it possible to encode all codepoints whether those codepoints should be used or not. Unicode already has holes in various ranges anyways.
I wish it were true, but it's not.
And even Swift is designed so the strings can be utf8 or utf16 for cheap objc interop reasons.
Discarding compatibility with 2 of the top ~5 most widely used languages kind of reflects how disconnected the author of this is from the technical realities if any fixed utf8 was feasible outside of the most toy use cases.
Its a missed opportunity that this isn't already the case - but if you're going to replace utf8, we should absolutely mandate one of the normalization forms along the way.
If you skip all the modifiers, you end up with an explosion in code space. If you skip all the precomposed characters, you end up with an explosion in bytes.
There's no good solution here, so normalization makes sense. But then the committee says ".. and what about this kind of normalization" and then you end up.. here.
But normalized forms are about sequences of code points that are semantically equivalent. You can't make the non-normalized code point sequences unencodable in an encoding that only looks at one code point at a time. You wouldn't want to anchor the encoding to any particular version of Unicode either.
Normalized forms have to happen at another layer. That layer is often omitted for efficiency or because nobody stopped to consider it, but the code point encoding layer isn't the right place.
Obviously UTF-8 has 256 code units (<00> to <FF>). 128 of them are always valid within a UTF-8 string (ASCII, <00> to <7F>), leaving 128 code units that could be invalid within a UTF-8 string (<80> to <FF>).
There also happen to be exactly 128 2-byte overlong representations (overlong representations of ASCII characters).
Basically, any byte in some input that can't be interpreted as valid UTF-8 can be replaced with a 2-byte overlong representation. This can be used as an extension of WTF-8 so that UTF-16 and UTF-8 errors can both be stored in the same stream. I called the encoding WTF-8b [2], though I'd be interested to know if someone else has come up with the same scheme.
This should technically be "fine" WRT Unicode text processing, since it involves transforming invalid Unicode into other invalid Unicode. This principle is already used by WTF-8.
I used it to improve preservation of invalid Unicode (ie, random 8-bit data in UTF-8 text or random 16-bit data in JSON strings) in jq, though I suspect the PR [1] won't be accepted. I still find the changes very useful personally, so maybe I'll come up with a different approach some time.
[0] https://github.com/Maxdamantus/jq/blob/911d01aaa5bd33137fadf...
[1] https://github.com/jqlang/jq/pull/2314
[2] I think I used the name "WTF-8b" as an allusion to UTF-8b/surrogateescape/PEP-383 which also encodes ill-formed UTF-8, though UTF-8b is less efficient storage-wise and is not compatible with WTF-8.
In something other than C, I'd expect they should be distinguished as members of an enumeration or something, eg:
enum DecodeResult {
Ok(char);
ErrUtf8(u8); // 0x80..0xFF
ErrUtf16(u16); // 0xD800..0xDFFF
}
UTF-16 is integral to the workings of Windows, Java, and JavaScript, so it's not going away anytime soon. To make things worse, those systems don't even support surrogates correctly, to the point where we had to build WTF-8, a system for handling malformed UTF-8 converted from these UTF-16 early adopters. Before we can start talking about characters beyond plane 16, we need to find an answer for how those existing systems should handle characters beyond U+10FFFF.
I can't think of a good way for them to do this, though:
1. Opting in to an alternate UTF-8 string type to migrate these systems off UTF-16 means loads of old software that just chokes on new characters. Do you remember how MySQL decided you had to opt into utf8mb4 encoding to use astral characters in strings? And how basically nobody bothered to do this up until emoji forced everyone's hand? Do you want to do that dance again, but for the entire Windows API?
2. We can't just "rip out UTF-16" without breaking compatibility. WCHAR strings in Windows are expected to be 16 bits long and hold Unicode codepoints, and programs can index those directly. JavaScript strings are a bit better in that they could be UTF-8 internally, but they still have length and indexing semantics inherited from Unicode 1.0.
3. If we don't "rip out UTF-16" though, then we need some kind of representation of characters beyond plane 16. There is no space left in plane 1 for this; we already used a good chunk of it for surrogates. Furthermore, it's a practical requirement of Unicode that all encodings be self-synchronizing. Deleting or inserting a byte shouldn't change the meaning of more than one or two characters.
The most practical way forward for >U+10FFFF "superastrals" would be to reserve space for super-surrogates in the currently unused plane 4-13 space. A plane for low surrogates and half a plane for high would give us 31 bits of coding, but they'd already be astral characters. This yields the rather comical result of requiring 8 bytes to represent a 4 byte codepoint, because of two layers of surrogacy.
If we hadn't already dedicated codepoints to the first layer of surrogates, we could have had an alternative with unlimited coding range like UTF-8. If I were allowed to redefine 0xD800-0xDFFF, I'd change them from low and high surrogates to initial and extension surrogates, as such:
- 2-word initial surrogate: 0b1101110 + 9 bits of initial codepoint index (U+10000 through U+7FFFF)
- 3-word initial surrogate: 0b11011110 + 8 bits of initial codepoint index (U+80000 through U+FFFFFFF)
- 4-word initial surrogate: 0b110111110 + 7 bits of initial codepoint index (U+10000000 through U+1FFFFFFFFF)
- Extension surrogate: 0b110110 + 10 bits of additional codepoint index
U+80000 to U+10FFFF now take 6 bytes to encode instead of 4, but in exchange we now can encode U+110000 through U+FFFFFFF in the same size. We can even trudge on to 37-bit codepoints, if we decided to invent a surrogacy scheme for UTF-32[0] and also allow FE/FF to signal very long UTF-8 sequences as suggested in the original article. Suffice it to say this is a comically overbuilt system.
Of course, the feasibility of this is also debatable. I just spent a good while explaining why we can't touch UTF-16 at all, right? Well, most of the stuff that is married to UTF-16 specifically ignores surrogates, treating it as headache for the application developer. In practice, mispaired surrogates never break things, that's why we had to invent WTF-8 to clean up after that mess.
You may have noticed that initial surrogates in my scheme occupy the coding space for low surrogates. Existing surrogates are supposed to be sent in the order high, low. So an initial, extension pair is actually the opposite surrogate order from what existing code expects. Unfortunately this isn't quite self-synchronizing in the world we currently live in. Deleting an initial surrogate will change the meaning of all following 2-word pairs to high/low pairs, unless you have some out of band way to signal that some text is encoded with initial / extension surrogates instead of high / low pairs. So I wouldn't recommend sending anything like this on the wire, and UTF-16 parsers would need to forbid mixed surrogacy ordering.
But then again, nobody sends UTF-16 on the wire anymore, so I don't know how much of a problem this would be. And of course, there's the underlying problem that the demand for codepoints beyond U+10FFFF is very low. Hell, the article itself admits the current Unicode growth rate has 600 years before it runs into this problem.
[0] Un(?)fortunately this would not be able to reuse the existing surrogate space for UTF-16, meaning we'd need to have a huge amount of the superastral planes reserved for even more comically large expansion.
chowells•7h ago
jmclnx•7h ago
But the "magic number" thing to me is a waste of space. If this standard is accepted, if no magic number you have corrected UTF-8.
As for \r\n, not a big deal to me. I would like to see if forbidden if only to force Microsoft to use \n like UN*X and Apple. I still need to deal with \r\n in files showing up every so often.
_kst_•6h ago
That's true only if "corrected UTF-8" is accepted and existing UTF-8 becomes obsolete. That can't happen. There's too much existing UTF-8 text that will never be translated to a newer standard.
Dwedit•5h ago
csb6•2h ago