It's Not Wrong that " ".length == 7

https://hsivonen.fi/string-length/

39•program•1h ago

Comments

bstsb•56m ago

ironic that unicode is stripped out the post's title here, making it very much wrong ;)

for context, the actual post features an emoji with multiple unicode codepoints in between the quotes

cmeacham98•46m ago

Funny enough I clicked on the post wondering how it could possibly be that a single space was length 7.

ale42•39m ago

Maybe it isn't a space, but a list of invisible Unicode chars...

yread•33m ago

It could also be a byte length of a 3 byte UTF-8 BOM and then some stupid space character like f09d85b3

eastbound•29m ago

It can be many Zero-Width Space, or a few Hair-Width Space.

You never know, when you don’t know CSS and try to align your pixels with spaces. Some programers should start a trend where 1 tab = 3 hairline-width spaces (smaller than 1 char width).

Next up: The <half-br/> tag.

c12•6m ago

I did exactly the same, thinking that maybe it was invisible unicode characters or something I didn't know about.

mrheosuper•54m ago

>We’ve seen four different lengths so far:

Number of UTF-8 code units (17 in this case) Number of UTF-16 code units (7 in this case) Number of UTF-32 code units or Unicode scalar values (5 in this case) Number of extended grapheme clusters (1 in this case)

We would not have this problem if we all agree to return number of bytes instead.

com2kid•45m ago

How would that help? UTF-8, 16, and 32 languages would still report different numbers.

curtisf•44m ago

"number of bytes" is dependent on the text encoding.

UTF-8 code units _are_ bytes, which is one of the things that makes UTF-8 very nice and why it has won

minebreaker•43m ago

> We would not have this problem if we all agree to return number of bytes instead.

I don't understand. It depends on the encoding isn't it?

charcircuit•43m ago

>Number of extended grapheme clusters (1 in this case)

Only if you are using a new enough version of unicode. If you were using an older version it is more than 1. As new unicode updates come out, the number of grapheme clusters a string has can change.

Aissen•47m ago

I'd disagree the number of unicode scalars is useless (in the case of python3), but it's a very interesting article nonetheless. Too bad unicode.org decided to break all the URLs in the table at the end.

darkwater•45m ago

(2019) updated in (2022)

DavidPiper•39m ago

I think that string length is one of those things that people (including me) don't realise they never actually want. In a production system, I have never actually wanted string length. I have wanted:

- Number of bytes this will be stored as in the DB

- Number of monospaced font character blocks this string will take up on the screen

- Number of bytes that are actually being stored in memory

"String length" is just a proxy for something else, and whenever I'm thinking shallowly enough to want it (small scripts, mostly-ASCII, mostly-English, mostly-obvious failure modes, etc) I like grapheme cluster being the sensible default thing that people probably expect, on average.

baq•26m ago

ASCII is very convenient when it fits in the solution space (it’d better be, it was designed for a reason), but in the global international connected computing world it doesn’t fit at all. The problem is all the tutorials, especially low level ones, assume ASCII so 1) you can print something to the console and 2) to avoid mentioning that strings are hard so folks don’t get discouraged.

Notably Rust did the correct thing by defining multiple slightly incompatible string types for different purposes in the standard library and regularly gets flak for it.

sigmoid10•24m ago

I have wanted string length many times in production systems for language processing. And it is perfectly fine as long as whatever you are using is consistent. I rarely care how many bytes an emoji actually is unless I'm worried about extreme efficiency in storage or how many monospace characters it uses unless I do very specific UI things. This blog is more of a cautionary tale what can happen if you unconsciously mix standards e.g. by using one in the backend and another in the frontend. But this is not a problem of string lengths per se, they are just one instance where modern implementations are all over the place.

xg15•7m ago

It gets more complicated if you do substring operations.

If I do s.charAt(x) or s.codePointAt(x) or s.substring(x, y), I'd like to know which values for x and y are valid and which aren't.

impure•35m ago

I learned this recently when I encountered a bug due to cutting an emoji character in two making it unable to render.

kazinator•32m ago

Why would I want this to be 17, if I'm representing strings as array of code points, rather than UTF-8?

TXR Lisp:

  1> (len " ")
  5
  2> (coded-length " ")
  17

(Trust me when I say that the emoji was there when I edited the comment.)

The second value takes work; we have to go through the code points and add up their UTF-8 lengths. The coded length is not cached.

troupo•12m ago

Obligatory, Emoji under the hood https://tonsky.me/blog/emoji/

spyrja•5m ago

I really hate to rant on about this. But the gymnastics required to parse UTF-8 correctly are truly insane. Besides that we now see issues such as invisible glyph injection attacks etc cropping up all over the place due to this crappy so-called "standard". Maybe we should just to go back to the simplicity of ASCII until we can come up with with something better?

MCP plugins study – 10% of tested plugins „fully exploitable"

Sapir-Whorf does not apply to Programming Languages

Physics-based simulation and optimization in desktop 3D printing

Biotech CEO sues Uber after illegal immigrant driver assault caught on camera

The "Super Weight:" How Even a Single Parameter Can Determine a LLM's Behavior

I have no mut and I must borrow

Organizers Are Demanding Palantir Drop Contracts with ICE and Israeli Military

Show HN: Ultra-fast, embedded KV store in pure Rust

China cut itself off from the global internet for an hour on Wednesday

Ask HN: Why is Prolog not gaining traction?

Martyrs to the Unspeakable: The Assassinations of JFK, Martin, Malcolm, and RFK

OUI-Spy Is a Slick Bluetooth Low Energy Scanner

How to load test PostgreSQL database and not miss anything

Sonic Liberation Devices

Claude Code's erratic behavior from May-August 2025

Germany's Ecosia proposes stewardship to run Google Chrome

Demand-Side Platform

Grok chats exposed in Google results

When People Giggle at Your Name, or the 2025 Hugo Awards Incident

The Minecraft code no one has solved (2024) [video]

Mitigating Backpressure from High Join Amplification with Unaligned Joins

Trump administration to vet all 55M foreigners with U.S. visas

Saneject: Dependency Injection the Unity Way

Insights from 100 Years of Research with Probiotic E. Coli

Too Much of a Good Thing: How Genericide Sends Trademarks to the Graveyard

What the Hell Is Going On?

Do blogs need to be so lonely?

Ryan Dancey on the Acquisition of TSR

My first project in Go is a terminal dashboard (what a programming language)

Opt-In Event Phases for Reliably Fast DOM Operations