Number of UTF-8 code units (17 in this case) Number of UTF-16 code units (7 in this case) Number of UTF-32 code units or Unicode scalar values (5 in this case) Number of extended grapheme clusters (1 in this case)
We would not have this problem if we all agree to return number of bytes instead.
UTF-8 code units _are_ bytes, which is one of the things that makes UTF-8 very nice and why it has won
I don't understand. It depends on the encoding isn't it?
Only if you are using a new enough version of unicode. If you were using an older version it is more than 1. As new unicode updates come out, the number of grapheme clusters a string has can change.
- Number of bytes this will be stored as in the DB
- Number of monospaced font character blocks this string will take up on the screen
- Number of bytes that are actually being stored in memory
"String length" is just a proxy for something else, and whenever I'm thinking shallowly enough to want it (small scripts, mostly-ASCII, mostly-English, mostly-obvious failure modes, etc) I like grapheme cluster being the sensible default thing that people probably expect, on average.
Notably Rust did the correct thing by defining multiple slightly incompatible string types for different purposes in the standard library and regularly gets flak for it.
If I do s.charAt(x) or s.codePointAt(x) or s.substring(x, y), I'd like to know which values for x and y are valid and which aren't.
TXR Lisp:
1> (len " ")
5
2> (coded-length " ")
17
(Trust me when I say that the emoji was there when I edited the comment.)The second value takes work; we have to go through the code points and add up their UTF-8 lengths. The coded length is not cached.
bstsb•56m ago
for context, the actual post features an emoji with multiple unicode codepoints in between the quotes
cmeacham98•46m ago
ale42•39m ago
yread•33m ago
eastbound•29m ago
You never know, when you don’t know CSS and try to align your pixels with spaces. Some programers should start a trend where 1 tab = 3 hairline-width spaces (smaller than 1 char width).
Next up: The <half-br/> tag.
c12•6m ago