frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Ask HN: What are your Unicode woes?

4•Rendello•15h ago
I've always worked with text, but I only started digging deep into understanding Unicode this year.

What do HN people have to say about Unicode and UTF-{8,16,32}? Are there parts you've never really understood? Have you had unexpected bugs due to misunderstood properties of text?

Comments

Rendello•15h ago
I (OP) have been working on some Unicode visualization tooling for a while now. The idea started when I had some buggy string-matching code. I was matching case-insensitively, then using those ranges to highlight the original text.

Turns out, sometimes changing case changes not only the number of bytes (in UTF-8), but the number of encoded characters! This led to my post "UTF-8 characters that behave oddly when the case is changed" [1], which inspired a lot of conversation that taught me a lot. After that, I started reading Unicode documentation in earnest, and building up an idea of what a new tool should show. I'm trying to make clear things I didn't (and sometimes still don't) understand, so I'd love to know what causes pains in the wild / gaps in people's understanding.

1. https://news.ycombinator.com/item?id=42014045

solardev•5h ago
I don't understand the difference between a character, a codepoint, a glyph, and whatever else makes up a single "thing" in unicode.
Rendello•3h ago
That tripped me up too. The Unicode Core spec is quite good at explaining things and introduces some terminology you don't really hear outside the document. Chapter 2, General Structure, is worth reading in its entirety. I've linked some bits that might help:

> *2.2.3 Characters, Not Glyphs*

> The Unicode Standard draws a distinction between characters and glyphs. Characters are the abstract representations of the smallest components of written language that have semantic value. They represent primarily, but not exclusively, the letters, punctuation, and other signs that constitute natural language text and technical notation. [...] Letters in different scripts, even when they correspond either semantically or graphically, are represented in Unicode by distinct characters.

> Characters are represented by code points that reside only in a memory representation, as strings in memory, on disk, or in data transmission. The Unicode Standard deals only with character codes.

> *2.4 Code Points and Characters*

> The range of integers used to code the abstract characters is called the codespace. A particular integer in this set is called a code point. When an abstract character is mapped or assigned to a particular code point in the codespace, it is then referred to as an encoded character.

> *2.5 Encoding Forms*

This deals with UTF-{8,16,32}, which is a tricky bit and tripped me up for a long time. If the document is too dense here, there's a lot of supplementary material online explaining the different forms, I'll link a Tom Scott video explaining UTF-8.

---

The long and short of it is: the atomic unit of Unicode is the character, or encoded character, which is a value that has been associated with a code point, which is an integer usually represented in hex for as U+XXXX. Unicode doesn't deal with glyphs or graphical representations, just characters and their properties (eg. what is the character name? what should this character do when uppercased?). As you probably know, many characters can combine with others to form grapheme clusters, which may look like a single (abstract) character, but underneath consist of multiple (encoded) characters. Every character is associated with an integer index (a codepoint), and those integers can be represented in three formats (this sort of happened by accident): UTF-32 (just represent the integer directly), UTF-16 (was originally supposed to represent the integer directly, but there were too many and it got extended), and UTF-8 (which has different byte lengths to encode different characters efficiently).

[spec] https://www.unicode.org/versions/Unicode16.0.0/core-spec/

[2.2.3] https://www.unicode.org/versions/Unicode16.0.0/core-spec/cha...

[2.4] https://www.unicode.org/versions/Unicode16.0.0/core-spec/cha...

[2.5] https://www.unicode.org/versions/Unicode16.0.0/core-spec/cha...

[Tom Scott UTF-8] https://www.youtube.com/watch?v=MijmeoH9LT4

NoahZuniga•3h ago
I guess its kind of annoying that letters with diacritics can be represented in multiple different ways
Rendello•2h ago
That's true, and even with normalization, there's four normalized forms for strings. The -k- forms are mostly for searching, but that still leaves NFC and NFD.

The normalization forms are explained, in order of approachability (imo), in this random Youtube video, the Unicode Annex #15, and the Unicode Core Spec:

https://www.youtube.com/watch?v=ttLD4DiMpiQ

https://unicode.org/reports/tr15/

https://www.unicode.org/versions/Unicode16.0.0/core-spec/cha...

Disturbing Rumor – PBS NewsHour (Brooks / Capehart)

1•mobileturdfctry•3m ago•0 comments

I build an anonymous stranger chat with no log in

https://randomize.chat/
2•henrymuddleton•19m ago•0 comments

Novo Nordisk's Canadian Mistake

https://www.science.org/content/blog-post/novo-nordisk-s-canadian-mistake
1•taubek•21m ago•0 comments

Show HN: Shields.rs – a Rust badge engine 10x faster than Node.js

https://github.com/Jannchie/shields.rs
1•jannchie•23m ago•0 comments

Software Engineering Talent Is Gold Right Now

https://gametorch.app/blog/software-engineering-talent
2•gametorch•28m ago•0 comments

Centralization or Decentralization? Evolution of State-Ownership in China (2022)

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4283197
1•walterbell•45m ago•0 comments

The Algebra of an Infinite Grid of Resistors

https://www.mathpages.com/home/kmath669/kmath669.htm
2•gone35•46m ago•0 comments

Ordinary users can also generate professional and creative print ads

https://www.piclabs.org
1•rooty_ship•46m ago•0 comments

Arkane Linux: Opinionated, immutable, atomic Arch-based distribution

https://arkanelinux.org/
2•theycallhermax•56m ago•0 comments

Remove Bug Bounty Program

https://github.com/CycloneDX/cyclonedx-rust-cargo/commit/93b19cb4ac96d1b8f51647df2b89ec4359becae1
2•Tomte•58m ago•0 comments

Adding .md URLs for Raw Markdown Content in Next.js

https://www.bengubler.com/posts/2025-06-14-raw-markdown-urls-nextjs
1•nebrelbug•1h ago•0 comments

Scaling Laws – Can Someone Tell Elon?

https://waymo.com/blog/2025/06/scaling-laws-in-autonomous-driving
3•bobby_mcbrown•1h ago•0 comments

Smooth Page Transitions in Next.js with next-view-transitions

https://www.bengubler.com/posts/2025-06-14-smooth-page-transitions-next-view-transitions
2•nebrelbug•1h ago•0 comments

The Trolley Problem: the UX of shopping carts (2023)

https://usamawaheed.substack.com/p/the-real-trolley-problem-the-ux-of
3•Mr_Minderbinder•1h ago•0 comments

Checking the Weather Shouldn't Be Boring

https://www.indiehackers.com/post/checking-the-weather-shouldn-t-be-boring-58b51893af
3•timetodine17•1h ago•2 comments

Generate an excuse when you just can't

https://quickalibi.com/
1•TandemApp•1h ago•0 comments

Everyone Is Wrong About Mexican Coke [video]

https://www.youtube.com/watch?v=NY66qpMFOYo
1•dataflow•1h ago•1 comments

Let us bury the linear model of innovation

https://lemire.me/blog/2025/06/12/let-us-bury-the-linear-model-of-innovation/
1•kristianp•1h ago•0 comments

Show HN: OllaMan – Intuitive Desktop UI Manager for Ollama AI Models

https://ollaman.com
1•Sulfide6416•1h ago•0 comments

Photos of secret Caltrain station apartment show $40k in illicit renovations

https://www.mercurynews.com/2025/06/12/secret-caltrain-apartment-photos-peninsula-station/
3•pastureofplenty•1h ago•0 comments

Historic 'No Kingd' Rally in Mountain View, CA Today

https://www.mv-voice.com/news/2025/06/14/historic-no-kings-rally-draws-thousands-to-el-camino-real-in-mountain-view-palo-alto/
4•metadat•1h ago•0 comments

The Divorce Detectives

https://www.ft.com/content/32f9ff80-4e25-41f1-9e13-da925317c246
1•sealeck•1h ago•1 comments

Oswald the Lucky Rabbit

https://en.wikipedia.org/wiki/Oswald_the_Lucky_Rabbit
1•benbreen•1h ago•0 comments

KB5060533 update triggers boot errors on Surface Hub v1 devices

https://www.bleepingcomputer.com/news/microsoft/microsoft-kb5060533-update-triggers-boot-errors-on-surface-hub-v1-devices/
1•numpad0•1h ago•1 comments

Greptile Bug Wiki

https://www.greptile.com/blog/introducing-bug-wiki
2•codeAligned•1h ago•0 comments

The Tech Plutocrats Dreaming of a Right-Wing San Francisco (2024)

https://newrepublic.com/article/178675/garry-tan-tech-san-francisco
22•consumer451•1h ago•1 comments

The Reasons Your Appliances Die Young

https://www.nytimes.com/wirecutter/reviews/modern-appliances-short-lifespan/
2•Kaibeezy•1h ago•0 comments

Testing Phone-Sized Faraday Bags

https://www.mattblaze.org/blog/faraday/
1•MrVandemar•1h ago•1 comments

PublishAPI – Lightweight sentiment API with daily limits and multi-key support

https://publishapi.org
1•toxi360•2h ago•0 comments

How to modify Starlink Mini to run without the built-in WiFi router

https://olegkutkov.me/2025/06/15/how-to-modify-starlink-mini-to-run-without-the-built-in-wifi-router/
1•walterbell•2h ago•0 comments