Number of UTF-8 code units (17 in this case) Number of UTF-16 code units (7 in this case) Number of UTF-32 code units or Unicode scalar values (5 in this case) Number of extended grapheme clusters (1 in this case)
We would not have this problem if we all agree to return number of bytes instead.
Edit: My mistake. There would still be inconsistency between different encoding. My point is, if we all decided to report number of bytes that string used instead number of printable characters, we would not have the inconsistency between languages.
UTF-8 code units _are_ bytes, which is one of the things that makes UTF-8 very nice and why it has won
I don't understand. It depends on the encoding isn't it?
Only if you are using a new enough version of unicode. If you were using an older version it is more than 1. As new unicode updates come out, the number of grapheme clusters a string has can change.
But that isn't the same across all languages, or even across all implementations of the same language.
- Number of bytes this will be stored as in the DB
- Number of monospaced font character blocks this string will take up on the screen
- Number of bytes that are actually being stored in memory
"String length" is just a proxy for something else, and whenever I'm thinking shallowly enough to want it (small scripts, mostly-ASCII, mostly-English, mostly-obvious failure modes, etc) I like grapheme cluster being the sensible default thing that people probably expect, on average.
Notably Rust did the correct thing by defining multiple slightly incompatible string types for different purposes in the standard library and regularly gets flak for it.
My understanding of the current "always and only utf-8/unicode" zeitgeist is that is comes mostly from encoding issues among which the complexity of detecting encoding.
I think that the current status quo is better than what came before, but I also think it could be improved.
The languages that i really dont get are those that force valid utf-8 everywhere but dont enforce NFC. Which is most of them but seems like the worst of both worlds.
Non normalized unicode is just as problematic as non validated unicode imo.
But if you do want a sequence of bytes for whatever reason, you can trivially obtain that in any version of Python.
I'll probably just use rust for that script if python2 ever gets dropped by my distro. Reminds me of https://gregoryszorc.com/blog/2020/01/13/mercurial%27s-journ...
Show me.
This is a script created by someone on #nethack a long time ago. It works great with other things as well like old BBS games. It was intended to transparently rewrite single byte encodings to multibyte with an optional conversion array.
It almost works as-is in my testing. (By the way, there's a typo in the usage message.) Here is my test process:
#!/usr/bin/env python
import random, sys, time
def out(b):
# ASCII 0..7 for the second digit of the color code in the escape sequence
color = random.randint(48, 55)
sys.stdout.buffer.write(bytes([27, 91, 51, color, 109, b]))
sys.stdout.flush()
for i in range(32, 256):
out(i)
time.sleep(random.random()/5)
while True:
out(random.randint(32, 255))
time.sleep(0.1)
I suppressed random output of C0 control characters to avoid messing up my terminal, but I added a test that basic ANSI escape sequences can work through this.(My initial version of this didn't flush the output, which mistakenly lead me to try a bunch of unnecessary things in the main script.)
After fixing the `print` calls, the only thing I was forced to change (although I would do the code differently overall) is the output step:
# sys.stdout.write(out.encode("UTF-8"))
sys.stdout.buffer.write(out.encode("UTF-8"))
sys.stdout.flush()
I've tried this out locally (in gnome-terminal) with no issue. (I also compared to the original; I have a local build of 2.7 and adjusted the shebang appropriately.)There's a warning that `bufsize=1` no longer actually means a byte buffer of size 1 for reading (instead it's magically interpreted as a request for line buffering), but this didn't cause a failure when I tried it. (And setting the size to e.g. `2` didn't break things, either.)
I also tried having my test process read from standard input; the handling of ctrl-C and ctrl-D seems to be a bit different (and in general, setting up a Python process to read unbuffered bytes from stdin isn't the most fun thing), but I generally couldn't find any issues here, either. Which is to say, the problems there are in the test process, not in `ibmfilter`. The input is still forwarded to, and readable from, the test process via the `Popen` object. And any problems of this sort are definitely still fixable, as demonstrated by the fact that `curses` is still in the standard library.
Of course, keys in the `special` mapping need to be defined as bytes literals now. Although that could trivially be adapted if you insist.
As for typo, yep. But then, I've left this script essentially untouched for a couple of decades since I was given it.
Here's a diff:
diff --git a/ibmfilter b/ibmfilter
index 245d32c..2633335 100755
--- a/ibmfilter
+++ b/ibmfilter
@@ -1,6 +1,5 @@
-#!/usr/bin/python2 -tt
-# vim:set fileencoding=utf-8
-
+#!/usr/bin/python3
+
from subprocess import *
import sys
import os, select
@@ -10,8 +9,8 @@ special = {
}
if len(sys.argv) < 2:
- print "usage: ibmfilter [command]"
- print "Runs command in a subshell and translates its output from ibm473 codepage to UTF-8."
+ print("usage: ibmfilter [command]")
+ print("Runs command in a subshell and translates its output from ibm473 codepage to UTF-8.")
sys.exit(0)
handle = Popen(sys.argv[1:], stdout=PIPE, bufsize=1)
@@ -26,8 +25,10 @@ while buf != '':
os.kill(handle.pid)
os.system('reset')
raise Exception("Timed out while waiting for stdout to be writeable...")
- sys.stdout.write(out.encode("UTF-8"))
-
+ sys.stdout.buffer.write(out.encode("UTF-8"))
+ sys.stdout.flush()
+
buf = handle.stdout.read(1)
handle.wait()
I already have tested it and it works fine as far as I can tell on every version since at least 3.3 through 3.13 inclusive. There's really nothing version specific here, except the warning I mentioned which is introduced in 3.8. If you encounter a problem, some more sophisticated diagnostics would be needed, and honestly I'm not actually sure where to start with that. (Although I'm mildly impressed that you still have access to a 2.7 interpreter in /usr/bin without breaking anything else.)If you want to add overrides, you must use bytes literals for the keys. That looks like:
b'\xff': 'X'
> (heck, pip even warns you not to try installing libs globally so everyone can use same set these days)Some Python programs have mutually incompatible dependencies, and you can't really have two versions of the same dependency loaded in the same runtime. This has always been a problem; you're just looking at the current iteration of pip trying to cooperate with Linux distros to help you not break your system as a result.
"Using the same set" is not actually desirable for development.
And with that out of the way. This one seems to mostly work!
So python3 did not significantly change handling this sort of byte stream and while Mercurial folks might well have had their own woes, I have no idea what the issues were in all those prior attempts with this file.
... that said, it does do one odd thing (following is output on launching):
/usr/lib/python3.12/subprocess.py:1016: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used
self.stdout = io.open(c2pread, 'rb', bufsize)
And yet, I can't spot any issues in gameplay, yet, caused by this, so I'm inclined to let it pass? But, it does make me wonder if later on, I might hit issues...At least for now, I'm going to tentatively say it seems fine. Hm. You know what. Let me try with some more obvious things that might fail if the buffer size is wrong.
So. Now I'm wondering why, given how relatively minor this change is (aside from the odd error message, and the typical python3 changes just one slightly modified line and one inserted line), why did so many pythonistas have so much difficulty over the many years I asked about this? I mean, I only formed my opinion that maybe there was a problem with python3 byte/string handling to just how many attempts there were... Were they trying to do things in a more idiomatic python3 fashion? Did the python3 APIs change? Does the error hint at something more concerning? Well, whatever. Clearly it's (mostly) fine now. And my carefully tweaked nethack profile is safe if python2 is removed without needing to make my own stream filter. Yay! Thanks!
... further updates.. ok there are a few issues.
1) the warning
2) there's an odd ghost flicker that jumps around the nethack level as if a cursor is appearing - does not happen in the python2 one.
3) on quitting it no longer it no longer exits gracefully and I have to ctrl-c the script.
4) It is much slower to render. the python2 one draws a screen almost instantly for most uses (although still a bit slower than not filtered, at least on this computer, for things that change a lot, like video). This one ripples down - that might explain the ghost flickering in ② and might be related to the buffer warning. This becomes much more noticeable with BBSes although it is usually fine in nethack. You can see the difference on a simpler testcase without setting up a BBS account by streaming a bit more data at once say by running: ibmfilter curl ascii.live/nyan
So, clearly not perfect but.. eh. functional? Still far better than prior attempts, and at least it mostly works with nethack.
Yes, that would be exactly why. You can use e.g. `sed` to remove leading whitespace from each line (I used it to add the leading whitespace for posting).
> ... that said, it does do one odd thing (following is output on launching):
Yes, that's the warning I mentioned. The original code requests to use a buffer size of 1, which is no longer supported (it now means to use line buffering).
> It is much slower to render.
Avoiding line buffering (by requiring a buffer size of 2 or more) might fix that. Actually, it might be a good idea to use a significantly larger buffer, so that e.g. an entire ANSI colour code can be read all at once.
The other issues are, I'm pretty sure, because of other things that changed in how `subprocess` works. Fixing things at this level would indeed require quite a bit more hacking around with the low-level terminal APIs.
> I mean, I only formed my opinion that maybe there was a problem with python3 byte/string handling to just how many attempts there were... Were they trying to do things in a more idiomatic python3 fashion? Did the python3 APIs change? Does the error hint at something more concerning?
Most likely, other attempts either a) didn't understand what the original code was doing in precise enough detail, or b) didn't know how to send binary data to standard output properly (Python 3 defaults to opening standard output as a text stream).
All of that said: I think that nowadays you should just be able to get a build of NetHack that just outputs UTF-8 characters directly; failing that, you can use the `locale` command to tell your terminal to expect cp437 data.
The unfortunate thing is the "lag" is a bit annoying with some apps, so I'll probably still use the python2 one for now.
It uses Latin-1 for ASCII strings, UCS-2 for strings that contain code points in the BMP and UCS-4 only for strings that contain code points outside the BMP.
It would be pretty silly for them to explode all strings to 4-byte characters.
They need at most 21 bits. The bits may only be available in multiples of 8, but the implementation also doesn't byte-pack them into 24-bit units, so that's moot.
I disagree. Not all text is human prose. For example, there is nothing wrong with an programming language that only allows ASCII in the source code and many downsides to allowing non-ASCII characters outside string constants or comments.
Lots of people around the world learn programming from sources in their native language, especially early in their career, or when software development is not their actual job.
Enforcing ASCII is the same as enforcing English. How would you feel if all cooking recipes were written in French? If all music theory was in Italian? If all industrial specifications were in German?
It's fine to have a dominant language in a field, but ASCII is a product of technical limitations that we no longer have. UTF-8 has been an absolute godsend for human civilization, despite its flaws.
You are severely underestimate how far you can get without any real command of the English language. I agree that you can't become really good without it, just like you can't do haute cuisine without some French, but the English language is a huge and unnecessary barrier of entry that you would put in front of everyone in the world who isn't submerged in the language from an early age.
Imagine learning programming using only your high school Spanish. Good luck.
This + translated materials + locally written books is how STEM fields work in East Asia, the odds of success shouldn't be low. There just needs to be enough population using your language.
And frequently, there is no other name. There are a lot of diseases, and no language has names for all of them.
Identifiers in code are not a limited vocabulary, and understanding the structure of your code is important, especially so when you are in the early stages of learning.
Andreas Rumpf, the designer of Nim, is Austrian. All the keywords of Nim are in English, the library function names are in English, the documentation is in English, Rumpf's book Mastering Nim is in English, the other major book for the language, Nim In Action (written by Dominik Picheta, nationality unknown but not American) is in English ... this is not "American imperialism" (which is a real thing that I don't defend), it's for easily understandable pragmatic reasons. And the language parser doesn't disallow non-ASCII characters but it doesn't treat them linguistically, and it has special rules for casefolding identifiers that only recognize ASCII letters, hobbling the use of non-ASCII identifiers because case distinguishes between types and other identifiers. The reason for this lack of handling of Unicode linguistically is simply to make the lexer smaller and faster.
Maybe I'm tired, but I've read this multiple times and can't quite figure out your desired position.
I *think* you are in favor of non -ASCII identifiers?
Like I said, I must be tired.
No, it is actually for security reasons. Once you allow non-ASCII identifiers, identifiers will become non identifiable. Only zig recognized that. Nim allows insecure identifiers. https://github.com/rurban/libu8ident/blob/master/doc/c11.md#...
The motte: non-ASCII identifiers should be allowed
The bailey: disallowing non-ASCII identifiers is American imperialism at its worst
UNICODE is essentially a superset of ASCII, and the UTF-8 encoding also contains ASCII as compatible subset (e.g. for the first 127 UNICODE code points, an UTF-8 encoded string is byte-by-byte compatible with the same string encoded in ASCII).
Just don't use any of the Extended ASCII flavours (e.g. "8-bit ASCII with codepages") - or any of the legacy 'national' multibyte encodings (Shift-JIS etc...) because that's how you get the infamous `?????` or `♥♥♥♥♥` mismatches which are commonly associated with 'ASCII' (but this is not ASCII, but some flavour of Extended ASCII decoded with the wrong codepage).
In fact it's awesome that we have one common very simple character set and language that works everywhere and can do everything.
I have only encountered source code using my native language (German) in comments or variable names in highly unprofessional or awful software and it is looked down upon. You will always get an ugly mix and have to mentally stop to figure out which language a name is in. It's simply not worth it.
Please stop pushing this UTF-8 everywhere nonsense. Make it work great on interactive/UI/user facing elements but stop putting UTF-8-only restrictions in low-level software. Example: Copied a bunch of ebooks to my phone, including one with a mangled non-UTF-8 name. It was ridiculously hard to delete the file as most Android graphical and console tools either didn't recognize it or crashed.
I was with you until this sentence. UTF-8 everywhere is great exactly because it is ASCII-compatible (e.g. all ASCII strings are automatically also valid UTF-8 strings, so UTF-8 is a natural upgrade path from ASCII) - both are just encodings for the same UNICODE codepoints, ASCII just cannot go beyond the first 127 codepoints, but that's where UTF-8 comes in and in a way that's backward compatible with ASCII - which is the one ingenious feature of the UTF-8 encoding.
And bytes can conveniently fit both ASCII and UTF-8.
If you want to restrict your programming language to ASCII for whatever reason, fine by me. I don't need "let wohnt_bei_Böckler_STRAẞE = ..." that much.
But if you allow full 8-bit bytes, please don't restrict them to UTF-8. If you need to gracefully handle non-UTF-8 sequences graphically show the appropriate character "�", otherwise let it pass through unmodified. Just don't crash, show useless error messages or in the worst case try to "fix" it by mangling the data even more.
This string cannot be encoded as ASCII in the first place.
> But if you allow full 8-bit bytes, please don't restrict them to UTF-8
UTF-8 has no 8-bit restrictions... You can encode any 21-bit UNICODE codepoint with UTF-8.
It sound's like you're confusing ASCII, Extended ASCII and UTF-8:
- ASCII: 7-bits per "character" (e.g. not able to encode international characters like äöü) but maps to the lower 7-bits of the 21-bits of UNICODE codepoints (e.g. all ASCII character codes are also valid UNICODE code points)
- Extended ASCII: 8-bits per "character" but the interpretation of the upper 128 values depends on a country-specific codepage (e.g. the intepretation of a byte value in the range between 128 and 255 is different between countries and this is what causes all the mess that's usually associated with "ASCII". But ASCII did nothing wrong - the problem is Extended ASCII - this allows to 'encode' äöü with the German codepage but then shows different characters when displayed with a non-German codepage)
- UTF-8: a variable-width encoding for the full range of UNICODE codepoints, uses 1..4 bytes to encode one 21-bit UNICODE codepoint, and the 1-byte encodings are identical with 7-bit ASCII (e.g. when the MSB of a byte in an UTF-8 string is not set, you can be sure that it is a character/codepoint in the ASCII range).
Out of those three, only Extended ASCII with codepages are 'deprecated' and should no longer be used, while ASCII and UTF-8 are both fine since any valid ASCII encoded string is indistinguishable from that same string encoded as UTF-8, e.g. ASCII has been 'retconned' into UTF-8.
The problem they're describing happens because file names (in Linux and Windows) are not text: in Linux (so Android) they're arbitrary sequences of bytes, and in Windows they're arbitrary sequences of UTF-16 code points not necessarily forming valid scalar values (for example, surrogates can be present alone).
And yet, a lot of programs ignore that and insist on storing file names as Unicode strings, which mostly works (because users almost always name files by inputting text) until somehow a file gets written as a sequence of bytes that doesn't map to a valid string (i.e., it's not UTF-8 or UTF-16, depending on the system).
So what's probably happening in GP's case is that they managed somehow to get a file with a non-UTF-8-byte-sequence name in Android, and subsequently every App that tries to deal with that file uses an API that converts the file name to a string containing U+FFFD ("replacement character") when the invalid UTF-8 byte is found. So when GP tries to delete the file, the App will try to delete the file name with the U+FFFD character, which will fail because that file doesn't exist.
GP is saying that showing the U+FFFD character is fine, but the App should understand that the actual file name is not UTF-8 and behave accordingly (i.e. use the original sequence-of-bytes filename when trying to delete it).
Note that this is harder than it should be. For example, with the old Java API (from java.io[1]) that's impossible: if you get a `File` object from listing a directory and ask if it exists, you'll get `false` for GP's file, because the `File` object internally stores the file name as a Java string. To get the correct result, you have to use the new API (from java.nio.file[2]) using `Path` objects.
[1] https://developer.android.com/reference/java/io/File
[2] https://developer.android.com/reference/java/nio/file/Path
Sure, it's backward compatible, as in ASCII handling codes work on systems with UTF-8 locales, but how important is that?
It's only Windows which is stuck in the past here, and Microsoft had 3 decades to fix that problem and migrate away from codegpages to locale-asgnostic UTF-8 (UTF-8 was invented in 1992).
None of the signals were intuitive because they weren’t the typical English abbreviations!
Restricting the program part to ASCII is fine for me, but as a fellow German it's also important to recognize that we don't loose much by not having ä cömplete sät of letters. Everyone can write comprehensible German using ASCII characters only. So I would listen to what people from languages that really don't fit into ASCII have to say.
More relevantly though, good things can come from people who also did bad things; this isn't to justify doing bad things in hopes something good also happens, but it doesn't mean we need to ideologically purge good things based on their creators.
ASCII wasn't "imperialism," it was pragmatism. Yes, it privileged English -- but that's because the engineers designing it _spoke_ English and the US was funding + exporting most of the early computer and networking gear. The US Military essentially gave the world TCP/IP (via DARPA) for free!
Maybe "cultural dominance", but "imperialism at its worst" is a ridiculous take.
Now list anything as important from your list of downsides that's just as unfixable
That's a tradeoff you should carefully consider because there are also downsides to disallowing non-ASCII characters. The downsides of allowing non-ASCII mostly stem from assigning semantic significance to upper/lowercase (which is itself a tradeoff you should consider when designing a language). The other issue I can think of is homographs but it seems to be more of a theoretical concern than a problem you'd run into in practice.
When I first learned programming I used my native language (Finnish, which uses 3 non-ASCII letters: åäö) not only for strings and comments but also identifiers. Back then UTF-8 was not yet universally adopted (ISO 8859-1 character set was still relatively common) so I occasionally encountered issues that I had no means to understand at the time. As programming is being taught to younger and younger audiences it's not reasonable to expect kids from (insert your favorite non-English speaking country) to know enough English to use it for naming. Naming and, to an extent, thinking in English requires a vocabulary orders of magnitude larger than knowing the keywords.
By restricting source code to ASCII only you also lose the ability to use domain-specific notation like mathematical symbols/operators and Greek letters. For example in Julia you may use some mathematical operators (eg. ÷ for Euclidean division, ⊻ for exclusive or, ∈/∉/∋ for checking set membership) and I find it really makes code more pleasant to read.
Not saying the trade-off isn't worth it, but I do feel like there is a tendency to overuse unicode somewhat in Julia.
Just never ever use Extended ASCII (8-bits with codepages).
In addition to separate string types, they have separate iterator types that let you explicitly get the value you want. So:
String.len() == number of bytes
String.bytes().count() == number of bytes
String.chars().count() == number of unicode scalar values
String.graphemes().count() == number of graphemes (requires unicode-segmentation which is not in the stdlib)
String.lines().count() == number of lines
Really my only complaint is I don't think String.len() should exist, it's too ambiguous. We should have to explicitly state what we want/mean via the iterators. String.graphemes().count()
That's a real nice API. (Similarly, python has @ for matmul but there is not an implementation of matmul in stdlib. NumPy has a matmul implementation so that the `@` operator works.)ugrapheme and ucwidth are one way to get the graphene count from a string in Python.
It's probably possible to get the grapheme cluster count from a string containing emoji characters with ICU?
String.chars().count(), String.codePoints().count(), and, for historical reasons, String.getBytes(UTF-8).lengthMost people aren't living in that world. If you're working at Amazon or some business that needs to interact with many countries around the globe, sure, you have to worry about text encoding quite a bit. But the majority of software is being written for a much narrower audience, probably for one single language in one single country. There is simply no reason for most programmers to obsess over text encoding the way so many people here like to.
Here's a better analogy, in the 70s "nobody planned" for names with 's in then. SQL injections, separators, "not in the alphabet", whatever. In the US. Where a lot of people with 's in their names live... Or double-barrelled names.
It's a much simpler problem and still tripped a lot of people
And then you have to support a user with a "funny name" or a business with "weird characters", or you expand your startup to Canada/Mexico and lo and behold...
Even plain English text can't be represented with plain ASCII (although ISO-8859-1 goes a long way).
There are some cases where just plain ASCII is okay, but there are quite few of them (and even those are somewhat controversial).
The solution is to just use UTF-8 everywhere. Or maybe UTF-16 if you really have to.
If I do s.charAt(x) or s.codePointAt(x) or s.substring(x, y), I'd like to know which values for x and y are valid and which aren't.
If you take a substring of a(bc) and compare it to string (bc) are you looking for bitwise equivalence or logical equivalence? If the former it's a bit easier (you can just memcmp) but if the latter you have to perform a normalization to one of the canonical forms.
I feel like if you’re looking for bitwise equivalence or similar, you should have to cast to some kind of byte array and access the corresponding operations accordingly
UTF-8 is a byte code format; Unicode is not. In Python, where all strings are arrays of Unicode code points, substrings are likewise arrays of Unicode code points.
(Also that's not what "character" means in the Unicode framework--some code points correspond to characters and some don't.)
P.S. Everything about the response to this comment is wrong, especially the absurd baseless claim that I misunderstood the claim that I quoted and corrected (that's the only claim I responded to).
My comment explains that you have misunderstood what the claim is. "Byte code format" was nonsensical (Unicode is not interpreted by a VM), but the point that comment was trying to make (as I understood it) is that not all subsequences of a valid sequence of (assigned) code points are valid.
> Also that's not what "character" means in the Unicode framework--some code points correspond to characters and some don't.
My definition does not contradict that. A code point is an integer in the Unicode code space which may correspond to a character. When it does, "character" trivially means the thing that the code point corresponds to, i.e., represents, as I said.
Neither of these are really useful unless you are implementing a font renderer or low level Unicode algorithm - and even then you usually only want to get the next code point rather than one at an arbitrary position.
- letter
- word
- 5 :P
Never thought of it but maybe there are rules that allow to visually present the code point for ß as ss? At least (from experience as a user) there seem to be a singular "ss" codepoint.
From a user experience perspective though it might be beneficial to pretend that "ß" == "ss" holds when parsing user input.
I never said it was ambiguous, I said it depends on the unicode version and the font you are using. How is that wrong? (Seems like the capital of ß is still SS in the latest unicode but since ẞ is the preferred capital version now this should change in the future)
I don't know how or if systems deal with this, but ß should be printed as ss if ß is unavailable in the font. It's possible this is completely up to the user.
[1] https://unicode.org/faq/casemap_charprop.html [2] https://www.rechtschreibrat.com/DOX/RfdR_Amtliches-Regelwerk...
Where does the source corroborate that claim? Can you give is a hint where to find the source?
While in older versions [1] it was the other way around:
> E3: Bei Schreibung mit Großbuchstaben schreibt man SS. Daneben ist auch die Verwendung des Großbuchstabens ẞ möglich. Beispiel: Straße – STRASSE – STRAẞE.
[1] https://www.rechtschreibrat.com/DOX/rfdr_Regeln_2016_redigie...
It's not. Uppercase of ß has always been SS.
Before we had a separate codepoint in Unicode this caused problems with round-tripping between upper and lower case. So Unicode rightfully introduced a separate codepoint specifically for that use case in 2008.
This inspired designers to design a glyph for that codepoint looking similar to ß. Nothing wrong with that.
Some liked the idea and it got some foothold, so in 2017, the Council for German Orthography allowed it as an acceptable variant.
Maybe it will win, maybe not, but for now in standard German the uppercase of ß is still SS and Unicode rightfully reflects that.
Thanks, that is interesting!
> In the case of Wordle, you know the exact set of letters you’re going to be using
This holds for the generator side too. In fact, you have a fixed word list, and the fixed alphabet tells you what a "letter" is, and thus how to compute length. Because this concerns natural language, this will coincide with grapheme clusters, and with English Wordle, that will in turn correspond to byte length because it won't give you words with é (I think). In different languages the grapheme clusters might be larger than 1 byte (e.g. [1], where they're codepoints).
Strings should be thought of more like opaque blobs, and you should derive their length exclusively in the context in which you intend to use it. It's an API anti-pattern to have a context-free length property associated with a string because it implies something about the receiver that just isn't true for all relevant usages and leads you to make incorrect assumptions about the result.
Refining your list, the things you usually want are:
- Number of bytes in a given encoding when saving or transmitting (edit: or more generally, when serializing).
- Number of code points when parsing.
- Number of grapheme clusters for advancing the cursor back and forth when editing.
- Bounding box in pixels or points for display with a given font.
Context-free length is something we inherited from ASCII where almost all of these happened to be the same, but that's not the case anymore. Unicode is better thought of as compiled bytecode than something you can or should intuit anything about.
It's like asking "what's the size of this JPEG." Answer is it depends, what are you trying to do?
You shouldn't really ever care about the number of code points. If you do, you're probably doing something wrong.
Grapheme cluster counts can’t be used because they’re unstable across Unicode versions. Some algorithms use UTF8 byte offsets - but I think that’s a mistake because they make input validation much more complicated. Using byte offsets, there’s a whole lot of invalid states you can represent easily. Eg maybe insert “a” at position 0 is valid, but inserting at position 1 would be invalid because it might insert in the middle of a codepoint. Then inserting at position 2 is valid again. If you send me an operation which happened at some earlier point in time, I don’t necessarily have the text document you were inserting into handy. So figuring out if your insertion (and deletion!) positions are valid at all is a very complex and expensive operation.
Codepoints are way easier. I can just accept any integer up to the length of the document at that point in time.
You have the same problem with code points, it's just hidden better. Inserting "a" between U+0065 and U+0308 may result in a "valid" string but is still as nonsensical as inserting "a" between UTF-8 bytes 0xC3 and 0xAB.
This makes code points less suitable than UTF-8 bytes as mistakes are more likely to not be caught during development.
> This makes code points less suitable than UTF-8 bytes as mistakes are more likely to not be caught during development.
Disagree. Allowing 2 kinds of bugs to slip through to runtime doesn’t make your system more resilient than allowing 1 kind of bug. If you’re worried about errors like this, checksums are a much better idea than letting your database become corrupted.
Like it or not, code points are how Unicode works. Telling people to ignore code points is telling people to ignore how data works. It's of the same philosophy that results in abstraction built on abstraction built on abstraction, with no understanding.
I vehemently dissent from this view.
Trying to handle code points as atomic units fails even in trivial and extremely common cases like diacritics, before you even get to more complicated situations like emoji variants. Solving pretty much any real-world problem involving a Unicode string requires factoring in canonical forms, equivalence classes, collation, and even locale. Many problems can’t even be solved at the _character_ (grapheme) level—text selection, for example, has to be handled at the grapheme _cluster_ level. And even then you need a rich understanding of those graphemes to know whether to break them apart for selection (ligatures like fi) or keep them intact (Hangul jamo).
Yes, people should learn about code points. Including why they aren’t the level they should be interacting with strings at.
Ironic.
> The advice wasn’t to ignore learning about code points
I didn't say "learning about."
Look man. People operate at different levels of abstraction, depending on what they're doing.
If you're doing front-end web dev, sure, don't worry about it. If you're hacking on a text editor in C, then you probably ought to be able to take a string of UTF-8 bytes, decode them into code points, and apply the grapheme clustering algorithm to them, taking into account your heuristics about what the terminal supports. And then probably either printing them to the screen (if it seems like they're supported) or printing out a representation of the code points. So yeah, you kind of have to know.
So don't sit there and presume to tell others what they should or should not reason about, based solely on what you assume their use case is.
No, it's telling people that they're don't understand how data works otherwise they'd be using a different unit of measurement
Nobody is saying that, the point is that if you're parsing Unicode by counting codepoints you're doing it wrong. The way you actually parse Unicode text (in 99% of cases) is by iterating through the codepoints, and then the actual count is fairly irrelevant, it's just a stream.
Other uses of codepoint length are also questionable: for measurement it's useless, for bounds checking (random access) it's inefficient. It may be useful in some edge cases, but TFA's point is that a general purpose language's default string type shouldn't optimize for edge cases.
size(JPG) == bytes? sectors? colors? width? height? pixels? inches? dpi?
Even this has to deal with the halfwidth/fullwidth split in CJK. Even worse, Devanagari has complex rendering rules that actually depend on font choices. AFAIU, the only globally meaningful category here is rendered bounding box, which is obviously font-dependent.
But I agree with the general sentiment. What we really about how much space these text blobs take up, whether that be in a DB, in memory, or on the screen.
Most people care about the length of a string in terms of the number of characters.
Treating it as a proxy for the number of bytes has been incorrect ever since UTF-8 became the norm (basically forever), and if you're dealing with anything beyond ASCII (which you really should, since East Asian users alone number in the billions).
Same goes to the "string width".
Yes, Unicode scalar values can combine into a single glyph and cause discrepancies, as the article mentions, but that is a much rarer edge case than simply handling non-ASCII text.
And before that the only thing the relative rarity did for you was that bugs with code working on UTF-8 bytes got fixed while bugs that assumed UTF-16 units or 32-bit code points represent a character were left to linger for much longer.
The metrics you care about are likely number of letters from a human perspective (1) or the number of bytes of storage (depends), possibly both.
[1]: https://tomsmeding.com/unicode#U+65%20U+308 [2]: https://tomsmeding.com/unicode#U+EB
In an environment that supports advanced Unicode features, what exactly do you do with the string length?
I want to make sure that the password is between a given number of characters. Same with phone numbers, email addresses, etc.
This seems to have always been known as the length of the string.
This thread sounds like a bunch of scientists trying to make a simple concept a lot harder to understand.
> This seems to have always been known as the length of the string.
Sure. And by this definition, the string discussed in TFA (that consists of a facepalm emoji with a skin tone set) objectively has 5 characters in it, and therefore a length of 5. And it has always had 5 characters in it, since it was first possible to create such a string.
Similarly, "é" has one character in it, but "é" has two despite appearing visually identical. Furthermore, those two strings will not compare equal in any sane programming language without explicit normalization (unless HN's software has normalized them already). If you allow passwords or email addresses to contain things like this, then you have to reckon with that brute fact.
None of this is new. These things have fundamentally been true since the introduction of Unicode in 1991.
do you mean "byte"? or "rune"?
If you do allow Unicode characters in whatever it is you're validating, then your approach is almost certainly wrong for some valid input.
For exact lengths, you often have a restricted character set (like for phone numbers) and can validate both characters and length with a regex. Or the length in bytes works for 0–9.
Unless you're involved in text layout, you actually usually don't wind up needing the exact length in characters of arbitrary UTF-8 text.
When I'm comparing the human-readable strings I want the letgth. In all other cases I want sizeof(string) and it's... quite a variable thing.
The underlying issue is unit conversion. "length" is a poor name because it's ambiguous. Replacing "length" with three functions - "lengthInBytes", "lengthInCharacters", and "lengthCombined" - would make it a lot easier to pick the right thing.
To predict the pixel width of a given text, right?
One thing I ran into is that despite certain fonts being monospace, characters from different Unicode blocks would have unexpected lengths. Like I'd have expected half-width CJK letters to render to the same pixel dimensions as Latin letters do, but they don't. It's ever so slightly off. Same with full-width CJK letters vs two Latin letters.
I'm not sure if this is due to some font fallback. I'd have expected e.g. VS Code to be able to be able to render Japanese and English monospace in an aligned way without any fallbacks. Maybe once I have energy again to waste on this I'll look into it deeper.
* I'm talking about the DOM route, not <canvas> obviously. VS Code is powere by Monaco, which is DOM-based, not canvas-based. You can "Developer: Toggle Developer Tools" to see the DOM structure under the hood.
** I should further qualify my statement as browsers are fundamentally incapable of this if you use native text node rendering. I have built a perfectly monospace mixed CJK and Latin interface myself by wrapping each full width character in a separate span. Not exactly a performance-oriented solution. Also IIRC Safari doesn’t handle lengths in fractional pixels very well.
Seemed awkward but I eventually realized I rarely cared about number of characters. Even when dealing with substrings, I really only cared a means to describe “stuff” before/after not literal indices.
Counting Unicode characters is actually a disservice.
TXR Lisp:
1> (len " ")
5
2> (coded-length " ")
17
(Trust me when I say that the emoji was there when I edited the comment.)The second value takes work; we have to go through the code points and add up their UTF-8 lengths. The coded length is not cached.
UTF-8 is so complicated, because it wants to be backwards compatible with ASCII.
But thank you for the link, it's turning out to be a very enjoyable read! There already seems to be a few things I could do better thanks to the article, besides the fact that it codifies a lot of interesting approaches one can take to improve testing in general.
Python, of all languages, probably has the best property based testing library out there with "hypothesis". I sometimes even use it to drive tests for my Haskell and OCaml and Rust code. The author of Hypothesis wrote a few nice articles about why his approach is better (and I agree), however I can't find them at the moment..
Even ascii used to use "overstriking" where the backspace character was treated as a joiner character to put accents above letters.
- requires less memory for most strings, particular ones that are largely limited to ASCII like structured text-based formats often are.
- doesn't need to care about byte order. UTF-8 is always UTF-8 while UTF-16 might either be little or big endian and UCS-4 could theoretically even be mixed endian.
- doesn't need to care about alignment: If you jump to a random memory position you can find the next and previous UTF-8 characters. This also means that you can use preexisting byte-based string functions like substring search for many UTF-8 operations.
So "no combinations" was never going to happen.
Especially when you start getting into non latin-based languages.
Unicode definitely has its faults, but on the whole it‘s great. I‘ll take Unicode w/ UTF-8 any day over the mess of encodings we had before it.
Needless to say, Unicode is not a good fit for every scenario.
Those really seem hellish to parse, because there seem to be several mutually independent schemes how characters are combined to clusters, depending on what you're dealing with.
E.g. modifier characters, tags, zero-width joiners with magic emoji combinations, etc.
So you need both a copy of the character database and knowledge of the interaction of those various invisible characters.
bool utf_append_plaintext(utf* result, const char* text) {
#define msk(byte, mask, value) ((byte & mask) == value)
#define cnt(byte) msk(byte, 0xc0, 0x80)
#define shf(byte, mask, amount) ((byte & mask) << amount)
utf_clear(result);
if (text == NULL)
return false;
size_t siz = strlen(text);
uint8_t* nxt = (uint8_t*)text;
uint8_t* end = nxt + siz;
if ((siz >= 3) && (nxt[0] == 0xef) && (nxt[1] == 0xbb) && (nxt[2] == 0xbf))
nxt += 3;
while (nxt < end) {
bool aok = false;
uint32_t cod = 0;
uint8_t fir = nxt[0];
if (msk(fir, 0x80, 0)) {
cod = fir;
nxt += 1;
aok = true;
} else if ((nxt + 1) < end) {
uint8_t sec = nxt[1];
if (msk(fir, 0xe0, 0xc0)) {
if (cnt(sec)) {
cod |= shf(fir, 0x1f, 6);
cod |= shf(sec, 0x3f, 0);
nxt += 2;
aok = true;
}
} else if ((nxt + 2) < end) {
uint8_t thi = nxt[2];
if (msk(fir, 0xf0, 0xe0)) {
if (cnt(sec) && cnt(thi)) {
cod |= shf(fir, 0x0f, 12);
cod |= shf(sec, 0x3f, 6);
cod |= shf(thi, 0x3f, 0);
nxt += 3;
aok = true;
}
} else if ((nxt + 3) < end) {
uint8_t fou = nxt[3];
if (msk(fir, 0xf8, 0xf0)) {
if (cnt(sec) && cnt(thi) && cnt(fou)) {
cod |= shf(fir, 0x07, 18);
cod |= shf(sec, 0x3f, 12);
cod |= shf(thi, 0x3f, 6);
cod |= shf(fou, 0x3f, 0);
nxt += 4;
aok = true;
}
}
}
}
}
if (aok)
utf_push(result, cod);
else
return false;
}
return true;
#undef cnt
#undef msk
#undef shf
}
Not exactly "simple", is it? I am almost embarrassed to say that I thought I had read the spec right. But of course I was obviously wrong and now I have to go back to the drawing board (or else find some other FOSS alternative written in C). It just frustrates me. I do appreciate the level of effort made to come up with an all-encompassing standard of sorts, but it just seems so unnecessarily complicated.Here's the implementation in the Rust standard library: https://doc.rust-lang.org/stable/src/core/str/validations.rs...
It even includes an optimized fast path for ASCII, and it works at compile-time as well.
Why are the arguments not three-letter though? I would feel terrible if that was my code.
e.g., https://github.com/mayo-dayo/app/blob/0.4/src/middleware.ts
Just set your editor's line-height.
static UnicodeCodepoint utf8_decode(u8 const bytes[static 4], u8 *out_num_consumed) {
u8 const flipped = ~bytes[0];
if (flipped == 0) {
// Because __builtin_clz is UB for value 0.
// When his happens, the UTF-8 is malformed.
*out_num_consumed = 1;
return 0;
}
u8 const num_ones = __builtin_clz(flipped) & 0x07;
u8 const num_bytes_total = num_ones > 1 ? num_ones : 1;
u8 const main_byte_shift = num_ones + 1;
UnicodeCodepoint value = bytes[0] & (0xFF >> main_byte_shift);
for (u8 i = 1; i < num_bytes_total; ++i) {
if (bytes[i] >> 6 != 2) {
// Not a valid continuation byte.
*out_num_consumed = i;
return 0;
}
value = (value << 6) | (bytes[i] & 0x3F);
}
*out_num_consumed = num_bytes_total;
return value;
} > [...(new Intl.Segmenter()).segment(THAT_FACEPALM_EMOJI)].length
1
[^1]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...[^2]: https://caniuse.com/mdn-javascript_builtins_intl_segmenter_s...
Therefore, people should use codepoints for things like length limits or database indexes.
But wouldn't this just move the "cause breakage with new Unicode version" problem to a different layer?
If a newer Unicode version suddenly defines some sequences to be a single grapheme cluster where there were several ones before and my database index now suddenly points to the middle of that cluster, what would I do?
Seems to me, the bigger problem is with backwards compatibility guarantees in Unicode. If the standard is continuously updated and they feel they can just make arbitrary changes to how grapheme clusters work at any time, how is any software that's not "evergreen" (I.e. forces users onto the latest version and pretends older versions don't exist) supposed to deal with that?
> If the standard is continuously updated and they feel they can just make arbitrary changes to how grapheme clusters work at any time, how is any software that's not "evergreen" (I.e. forces users onto the latest version and pretends older versions don't exist) supposed to deal with that?
Why would software need to have a permanent, durable mapping between a string and the number of grapheme clusters that it contains?
"For example, the Unicode version dependency of extended grapheme clusters means that you should never persist indices into a Swift strings and load them back in a future execution of your app, because an intervening Unicode data update may change the meaning of the persisted indices! The Swift string documentation does not warn against this.
You might think that this kind of thing is a theoretical issue that will never bite anyone, but even experts in data persistence, the developers of PostgreSQL, managed to make backup restorability dependent on collation order, which may change with glibc updates."
You're right it doesn't say "codepoints" as an alternative solution. That was just my assumption as it would be the closest representation that does not depend on the character database.
But you could also use code units, bytes, whatever. The problem will be the same if you have to reconstruct the grapheme clusters eventually.
> Why would software need to have a permanent, durable mapping between a string and the number of grapheme clusters that it contains?
Because splitting a grapheme cluster in half can change its semantics. You don't want that if you e.g. have an index for fulltext search.
On the contrary, the article calls code point indexing “rather useless” in the subtitle. Code unit indexing is the appropriate technique. (“Byte indexing” generally implies the use of UTF-8 and in that context is more meaningfully called code unit indexing. But I just bet there are systems out there that use UTF-16 or UTF-32 and yet use byte indexing.)
> The problem will be the same if you have to reconstruct the grapheme clusters eventually.
In practice, you basically never do. Only your GUI framework ever does, for rendering the text and for handling selection and editing. Because that’s pretty much the only place EGCs are ever actually relevant.
> You don't want that if you e.g. have an index for fulltext search.
Your text search won’t be splitting by grapheme clusters, it’ll be doing word segmentation instead.
No it doesn't. It says it's "rather useless" that len(str) returns the number of code points, because there's rarely a reason to store the count of code points as the string length. By contrast, storing the number of native code units is useful for storage allocation and concatenation, which are common operations.
Code point indexing is still very useful, depending on context. For example, a majority of Korean speakers (~50 million Internet users) prefer deletion by Jaso unit. Korean EGCs are whole syllables, and making someone retype a whole syllable to change one character is bad UX.
• https://news.ycombinator.com/item?id=36159443 (June 2023, 280 points, 303 comments; title got reemojied!)
• https://news.ycombinator.com/item?id=26591373 (March 2021, 116 points, 127 comments)
• https://news.ycombinator.com/item?id=20914184 (September 2019, 230 points, 140 comments)
I’m guessing this got posted by one who saw my comment https://news.ycombinator.com/item?id=44976046 today, though coincidence is possible. (Previous mention of the URL was 7 months ago.)
$ raku
Welcome to Rakudo™ v2025.06.
Implementing the Raku® Programming Language v6.d.
Built on MoarVM version 2025.06.
[0] > " ".chars
1
[1] > " ".codes
5
[2] > " ".encode('UTF-8').bytes
17
[3] > " ".NFD.map(*.chr.uniname)
(FACE PALM EMOJI MODIFIER FITZPATRICK TYPE-3 ZERO WIDTH JOINER MALE SIGN VARIATION SELECTOR-16)Some other fun examples: https://gist.github.com/ozanmakes/0624e805a13d2cebedfc81ea84...
But most programmers think in arrays of grapheme clusters, whether they know it or not.
Which, to humor the parent, is also true of raw bytes strings. One of the (valid) points raised by the gist is that `str` is not infallibly encodable to UTF-8, since it can contain values that are not valid Unicode.
> This also allows you to work with strings that contain arbitrary data falling outside of the unicode spectrum.
If I write,
def foo(s: str) -> …:
… I want the input string to be Unicode. If I need "Unicode, or maybe with bullshit mixed in", that can be a different type, and then I can take def foo(s: UnicodeWithBullshit) -> …:It's a common mistake. A lot of code was written using str despite users needing it to operate on UnicodeWithBullshit. PEP 383 was a necessary escape hatch to fix countless broken programs.
No, nothing about the "string" type in python implies unicode. It's, for all intents and purposes, its own encoding, and should be treated as such. Not all encodings it can convert to are representable as unicode, and vice versa, so it makes no sense to think of it as unicode.
Python does it correctly and the results in that gist are expected. Characters are not grapheme clusters, and not every sequence of characters is valid. The ability to store unpaired surrogate characters is a feature: it would take extra time to validate this when it only really matters at encoding time. It also empowers the "surrogateescape" error handler, that in turn makes it possible to supply arbitrary bytes in command line arguments, even while providing strings to your program which make sense in the common case. (Not all sequences of bytes are valid UTF-8; the error handler maps the invalid bytes to invalid unpaired surrogates.) The same character counts are (correctly) observed in many other programming languages; there's nothing at all "exceptional" about Python's treatment.
It's not actually possible to "treat strings as raw bytes", because they contain more than 256 possible distinct symbols. They must be encoded; even if you assume an ecosystem-wide encoding, you are still using that encoding. But if you wish to work with raw sequences of bytes in Python, the `bytes` type is built-in and trivially created using a `b'...'` literal, or various other constructors. (There is also a mutable `bytearray` type.) These types now correctly behave as a sequence of byte (i.e., integer ranging 0..255 inclusive) values; when you index them, you get an integer. I have personal experience of these properties simplifying and clarifying my code.
Unicode was fixed (no quotation marks), with the result that you now have clearly distinct types that honour the Zen of Python principle that "explicit is better than implicit", and no longer get `UnicodeDecodeError` from attempting an encoding operation or vice-versa. (This problem spawned an entire family of very popular and very confused Stack Overflow Q&As, each with probably countless unrecognized duplicates.) As an added bonus, the default encoding for source code files changed to UTF-8, which means in practical terms that you can actually use non-English characters in your code comments (and even identifier names, with restrictions) now and have it just work without declaring an encoding (since your text editor now almost certainly assumes that encoding in 2025). This also made it possible to easily read text files as text in any declared encoding, and get strings as a result, while also having universal newline mode work, and all without needing to reach for `io` or `codecs` standard libraries.
The community was not so much "dragged through a 15-year transition"; rather, some members of the community spent as long as 15 (really 13.5, unless you count people continuing to try to use 2.7 past the extended EOL) years refusing to adapt to what was a clear bugfix of the clearly broken prior behaviour.
If you want to see a more interesting case than emoji, check out Thai language. In Thai, vowels could appear before, after, above, below, or on many sides of the associated consonants.
It’s not wrong that " ".length == 7 (2019) - https://news.ycombinator.com/item?id=36159443 - June 2023 (303 comments)
String length functions for single emoji characters evaluate to greater than 1 - https://news.ycombinator.com/item?id=26591373 - March 2021 (127 comments)
String Lengths in Unicode - https://news.ycombinator.com/item?id=20914184 - Sept 2019 (140 comments)
Dealing with wide strings sounds like hell to me. Right up there with timezones. I'm perfectly happy with plain C in the embedded world.
Python's flexible string system has nothing to do with this. Python could easily have had len() return the byte count, even the USV count, or other vastly more meaningful metrics than "5", whose unit is so disastrous I can't put a name to it. It's not bytes, it's not UTF-16 code units, it's not anything meaningful, and that's the problem. In particular, the USV count would have been made easy (O(1) easy!) by Python's flexible string representation.
You're handwaving it away in your writing by calling it a "character in the implementation", but what is a character? It's not a character in any sense a normal human would recognize — like a grapheme cluster — as I think if I asked a human "how many characters is <imagine this is man with skin tone face palming>?", they'd probably say "well, … IDK if it's really a character, but 1, I suppose?" …but "5" or "7"? Where do those even come from? An astute person might like "Oh, perhaps that takes more than one byte, is that it's size in memory?" Nope. Again: "character in the implementation" is a meaningless concept. We've assigned words to a thing to make it sound meaningful, but that is like definitionally begging the question here.
The unit is perfectly meaningful.
It's "characters". (Pedantically, "code points" — https://www.unicode.org/glossary/#code_point — because values that haven't been assigned to characters may be stored. This is good for interop, because it allows you to receive data from a platform that implements a newer version of the Unicode standard, and decide what to do with the parts that your local terminal, font rendering engine, etc. don't recognize.)
Since UTF-32 allows storing every code point in a single code unit, you can also describe it that way, despite the fact that Python doesn't use a full 4 bytes per code point when it doesn't have to.
The only real problem is that "character" doesn't mean what you think it does, and hasn't since 1991.
I don't understand what you mean by "USV count".
> but what is a character?
It's what the Unicode standard says a character is. https://www.unicode.org/glossary/#character , definition 3. Python didn't come up with the concept; Unicode did.
> …but "5" or "7"? Where do those even come from?
From the way that the Unicode standard dictates that this text shall be represented. This is not Python's fault.
> Again: "character in the implementation" is a meaningless concept.
"Character" is completely meaningful, as demonstrated by the fact the Unicode Consortium defines it, and by the fact that huge amounts of software has been written based on that definition, and referring to it in documentation.
Python does not use UTF-32, even notionally. Yes, I know it uses a compact representation in memory when the value is ASCII, etc. That's not what I'm talking about here. |str| != |all UTF32 strings|; `str` and "UTF-32" are different things, as there are values in the former that are absent in the latter, and again, this is why encoding to utf8 or any utf encoding is fallible in Python.
Code points is not a meaningful metric, though I suppose strictly speaking, yes, len() is code points.
> I don't understand what you mean by "USV count".
The number of Unicode scalar values in the string. (If the string were encoded in UTF-32, the length of that array.) It's the basic building block of Unicode. It's only marginally useful, and there's a host of other more meaningful metrics, like memory size, terminal width, graphemes, etc. But it's more meaningful than code points, and if you want to do anything at any higher level of representation, USVs are going to be what you want to build off. Anything else is going to be more fraught with error, needlessly.
> It's what the Unicode standard says a character is.
The Unicode definition of "character" is not a technical definition, it's just there to help humans. Again, if I fed that definition to a human, and asked the same question above, <facepalm…> is 1 "character", according to that definition in Unicode as evaluated by a reasonable person. That's not the definition Python uses, since it returns 5. No reasonable person is looking at the linked definition, and then at the example string, and answering "5".
"How many smallest components of written language that has semantic value does <facepalm emoji …> have?" Nobody is answering "5".
(And if you're going to quibble with my use of definition (1.), the same applies to (2.). (3.) doesn't apply here as Python strings are not Unicode strings (again, |str| != |all Unicode strings|), (4.) is specific to Chinese.)
> "Character" is completely meaningful, as demonstrated by the fact the Unicode Consortium defines it, and by the fact that huge amounts of software has been written based on that definition, and referring to it in documentation.
A lot of people write bad code does not make bad code good. Ambiguous technical documentation is likewise not made good by being ambiguous. Any use of "character" in technical writing would be made more clear by replacing it with one of the actual technical terms defined by Unicode, whether that's "UTF-16 code point", "USV", "byte", etc. "Character" leaves far too much up to the imagination of the reader.
No, codepoints are, hence their name. Scalars are a subset of all codepoints. https://stackoverflow.com/questions/48465265/what-is-the-dif...
> whether that's "UTF-16 code point"
That's not a thing; you're thinking of UTF-16 code units rather, I believe.
Yes, yes, the `str` type may contain data that doesn't represent a valid string. I've already explained elsewhere ITT that this is a feature.
And sure, pedantically it should be "UCS-4" rather than UTF-32 in my post, since a str object can be created which contains surrogates. But Python does not use surrogate pairs in representing text. It only stores surrogates, which it considers invalid at encoding time.
Whenever a `str` represents a valid string without surrogates, it will reliably encode. And when bytes are decoded, surrogates are not produced except where explicitly requested for error handling.
> The number of Unicode scalar values in the string. (If the string were encoded in UTF-32, the length of that array.)
Ah.
Good news: since Python doesn't use surrogate pairs to represent valid text, these are the same whenever the `str` contents represent a valid text string in Python. And the cases where they don't, are rare and more or less must be deliberately crafted. You don't even get them from malicious user input, if you process input in obvious ways.
> The Unicode definition of "character" is not a technical definition, it's just there to help humans.
You're missing the point. The facepalm emoji has 5 characters in it. The Unicode Consortium says so. And they are, indisputably, the ones who get to decide what a "character" is in the context of Unicode.
I linked to the glossary on unicode.org. I don't understand how it could get any more official than that.
Or do you know another word for "the thing that an assigned Unicode code point has been assigned to"? cf. also the definition of https://www.unicode.org/glossary/#encoded_character , and note that definition 2 for "character" is "synonym of abstract character".
I just relied on this fact yesterday, so it's kind of a funny timing. I wrote a little script that looks out for shenanigans in source files. One thing I wanted to explore was what Unicode blocks a given file references characters from. This is meaningless on the byte level, and meaningless on the grapheme cluster level. It is only meaningful on the codepoint level. So all I needed to do was to iterate through all the codepoints in the file, tally it all up by Unicode block, and print the results. Something this design was perfectly suited for.
Now of course:
- it coming in handy once for my specific random workload doesn't mean it's good design
- my specific workload may not be rational (am a dingus sometimes)
- at some point I did consider iterating by grapheme clusters, which the language didn't seem to love a whole lot, so more flexibility would likely indeed be welcome
- I am well and fully aware that iterating through data a few bytes at a time is abject terrible and possibly a sin. Too bad I don't really do coding in any proper native language, and I have basically no experience in SIMD, so tough shit.
But yeah, I really don't see why people find this so crazy. The whole article is in good part about how relying on grapheme cluster semantics makes you Unicode version dependent and that being a bit hairy, so it's probably not a good idea to default to it. At which point, codepoints it is. Counting scalars only is what would be weird in my view, you're "randomly" doing skips over the data potentially.
Also good against data fingerprinting, homoglyph attacks in links (e.g. in comments), pranks (greek question mark vs. semicolon), or if it's a strictly international codebase, checking for anything outside ASCII. So when you don't really trust a codebase and want to establish a baseline, basically.
But I also included other features, like checking line ending consistency, line indentation consistency, line lengths, POSIX compliance, and encoding validity. Line lengths were of particular interest to me, having seen some malicious PRs recently to FOSS projects where the attacker would just move the payload out of sight to the side, expecting most people to have word wrap off and just not even notice (pretty funny tbf).
" ".codePoints().count()
==> 5
" ".chars().count()
==> 7
" ".getBytes(UTF_8).length
==> 17
(HN doesn't render the emoji in comments, it seems)At first there was an empty space between the double quotes. This made me click and read the article because it was surprising that the length of a space would be 7.
Then the actual emoji appeared and the title finally made sense.
Now I see escaped \u{…} characters spelled out and it’s just ridiculous.
Can’t wait to come back tomorrow to see what it will be then.
> So I have an announcement to make: if you are a programmer working in 2003 and you don’t know the basics of characters, character sets, encodings, and Unicode, and I catch you, I’m going to punish you by making you peel onions for 6 months in a submarine. I swear I will.
Thank you!
https://stackoverflow.com/questions/2241348/what-are-unicode...
Still have more reading to do and a lot to learn but this was super informative, so thank you internet stranger.
1. Python3 plainly distinguishes between a string and a sequence of bytes. The function `len`, as a built-in, gives the most straightforward count: for any set or sequence of items, it counts the number of these items.
2. For a sequence of bytes, it counts the number of bytes. Taking this face-palming half-pale male hodgepodge and encoding it according to UTF-8, we get 17 bytes. Thus `len("\U0001F926\U0001F3FC\u200D\u2642\uFE0F".encode(encoding = "utf-8")) == 17`.
3. After bytes, the most basic entities are Unicode code points. A Python3 string is a sequence of Unicode code points. So for a Python3 string, `len` should give the number of Unicode code points. Thus `len("\U0001F926\U0001F3FC\u200D\u2642\uFE0F") == 5`.
Anything more is and should be beyond the purview of the simple built-in `len`:
4. Grapheme clusters are complicated and nearly as arbitrary as code points, hence there are “legacy grapheme clusters” – the grapheme clusters of older Unicode versions, because they changed – and “tailored grapheme clusters”, which may be needed “for specific locales and other customizations”, and of course the default “extended grapheme clusters”, which are only “a best-effort approximation” to “what a typical user might think of as a “character”.” Cf. https://www.unicode.org/reports/tr29
Of course, there are very few use cases for knowing the number of code points, but are there really much more for the number (NB: the number) of grapheme clusters?
Anyway, the great module https://pypi.org/project/regex/ supports “Matching a single grapheme \X”. So:
len(regex.findall(r"\X", "\U0001F926\U0001F3FC\u200D\u2642\uFE0F")) == 1
5. The space a sequence of code points will occupy on the screen: certainly useful but at least dependent on the typeface that will be used for rendering and hence certainly beyond the purview of a simple function.
bstsb•5mo ago
for context, the actual post features an emoji with multiple unicode codepoints in between the quotes
cmeacham98•5mo ago
ale42•5mo ago
yread•5mo ago
robin_reala•5mo ago
eastbound•5mo ago
You never know, when you don’t know CSS and try to align your pixels with spaces. Some programers should start a trend where 1 tab = 3 hairline-width spaces (smaller than 1 char width).
Next up: The <half-br/> tag.
Moru•5mo ago
c12•5mo ago
timeon•5mo ago
dang•5mo ago
Is there a way to represent this string with escaped codepoints? It would be both amusing and in HN's plaintext spirit to do it that way in the title above, but my Unicode is weak.
Mlller•5mo ago
dang•5mo ago
wonger_•5mo ago
NobodyNada•5mo ago
Might be a little long for a title :)
dang•5mo ago
NobodyNada•5mo ago
Phelinofist•5mo ago