frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Are AI agents ready for the workplace? A new benchmark raises doubts

https://techcrunch.com/2026/01/22/are-ai-agents-ready-for-the-workplace-a-new-benchmark-raises-do...
1•PaulHoule•4m ago•0 comments

Show HN: AI Watermark and Stego Scanner

https://ulrischa.github.io/AIWatermarkDetector/
1•ulrischa•5m ago•0 comments

Clarity vs. complexity: the invisible work of subtraction

https://www.alexscamp.com/p/clarity-vs-complexity-the-invisible
1•dovhyi•6m ago•0 comments

Solid-State Freezer Needs No Refrigerants

https://spectrum.ieee.org/subzero-elastocaloric-cooling
1•Brajeshwar•6m ago•0 comments

Ask HN: Will LLMs/AI Decrease Human Intelligence and Make Expertise a Commodity?

1•mc-0•7m ago•1 comments

From Zero to Hero: A Brief Introduction to Spring Boot

https://jcob-sikorski.github.io/me/writing/from-zero-to-hello-world-spring-boot
1•jcob_sikorski•8m ago•0 comments

NSA detected phone call between foreign intelligence and person close to Trump

https://www.theguardian.com/us-news/2026/feb/07/nsa-foreign-intelligence-trump-whistleblower
5•c420•8m ago•0 comments

How to Fake a Robotics Result

https://itcanthink.substack.com/p/how-to-fake-a-robotics-result
1•ai_critic•8m ago•0 comments

It's time for the world to boycott the US

https://www.aljazeera.com/opinions/2026/2/5/its-time-for-the-world-to-boycott-the-us
1•HotGarbage•9m ago•0 comments

Show HN: Semantic Search for terminal commands in the Browser (No Back end)

https://jslambda.github.io/tldr-vsearch/
1•jslambda•9m ago•1 comments

The AI CEO Experiment

https://yukicapital.com/blog/the-ai-ceo-experiment/
2•romainsimon•11m ago•0 comments

Speed up responses with fast mode

https://code.claude.com/docs/en/fast-mode
3•surprisetalk•14m ago•0 comments

MS-DOS game copy protection and cracks

https://www.dosdays.co.uk/topics/game_cracks.php
3•TheCraiggers•15m ago•0 comments

Updates on GNU/Hurd progress [video]

https://fosdem.org/2026/schedule/event/7FZXHF-updates_on_gnuhurd_progress_rump_drivers_64bit_smp_...
2•birdculture•16m ago•0 comments

Epstein took a photo of his 2015 dinner with Zuckerberg and Musk

https://xcancel.com/search?f=tweets&q=davenewworld_2%2Fstatus%2F2020128223850316274
7•doener•16m ago•2 comments

MyFlames: Visualize MySQL query execution plans as interactive FlameGraphs

https://github.com/vgrippa/myflames
1•tanelpoder•18m ago•0 comments

Show HN: LLM of Babel

https://clairefro.github.io/llm-of-babel/
1•marjipan200•18m ago•0 comments

A modern iperf3 alternative with a live TUI, multi-client server, QUIC support

https://github.com/lance0/xfr
3•tanelpoder•19m ago•0 comments

Famfamfam Silk icons – also with CSS spritesheet

https://github.com/legacy-icons/famfamfam-silk
1•thunderbong•19m ago•0 comments

Apple is the only Big Tech company whose capex declined last quarter

https://sherwood.news/tech/apple-is-the-only-big-tech-company-whose-capex-declined-last-quarter/
2•elsewhen•23m ago•0 comments

Reverse-Engineering Raiders of the Lost Ark for the Atari 2600

https://github.com/joshuanwalker/Raiders2600
2•todsacerdoti•24m ago•0 comments

Show HN: Deterministic NDJSON audit logs – v1.2 update (structural gaps)

https://github.com/yupme-bot/kernel-ndjson-proofs
1•Slaine•28m ago•0 comments

The Greater Copenhagen Region could be your friend's next career move

https://www.greatercphregion.com/friend-recruiter-program
2•mooreds•28m ago•0 comments

Do Not Confirm – Fiction by OpenClaw

https://thedailymolt.substack.com/p/do-not-confirm
1•jamesjyu•29m ago•0 comments

The Analytical Profile of Peas

https://www.fossanalytics.com/en/news-articles/more-industries/the-analytical-profile-of-peas
1•mooreds•29m ago•0 comments

Hallucinations in GPT5 – Can models say "I don't know" (June 2025)

https://jobswithgpt.com/blog/llm-eval-hallucinations-t20-cricket/
1•sp1982•29m ago•0 comments

What AI is good for, according to developers

https://github.blog/ai-and-ml/generative-ai/what-ai-is-actually-good-for-according-to-developers/
1•mooreds•29m ago•0 comments

OpenAI might pivot to the "most addictive digital friend" or face extinction

https://twitter.com/lebed2045/status/2020184853271167186
1•lebed2045•30m ago•2 comments

Show HN: Know how your SaaS is doing in 30 seconds

https://anypanel.io
1•dasfelix•31m ago•0 comments

ClawdBot Ordered Me Lunch

https://nickalexander.org/drafts/auto-sandwich.html
3•nick007•32m ago•0 comments
Open in hackernews

Ask HN: What are your Unicode woes?

5•Rendello•7mo ago
I've always worked with text, but I only started digging deep into understanding Unicode this year.

What do HN people have to say about Unicode and UTF-{8,16,32}? Are there parts you've never really understood? Have you had unexpected bugs due to misunderstood properties of text?

Comments

Rendello•7mo ago
I (OP) have been working on some Unicode visualization tooling for a while now. The idea started when I had some buggy string-matching code. I was matching case-insensitively, then using those ranges to highlight the original text.

Turns out, sometimes changing case changes not only the number of bytes (in UTF-8), but the number of encoded characters! This led to my post "UTF-8 characters that behave oddly when the case is changed" [1], which inspired a lot of conversation that taught me a lot. After that, I started reading Unicode documentation in earnest, and building up an idea of what a new tool should show. I'm trying to make clear things I didn't (and sometimes still don't) understand, so I'd love to know what causes pains in the wild / gaps in people's understanding.

1. https://news.ycombinator.com/item?id=42014045

solardev•7mo ago
I don't understand the difference between a character, a codepoint, a glyph, and whatever else makes up a single "thing" in unicode.
Rendello•7mo ago
That tripped me up too. The Unicode Core spec is quite good at explaining things and introduces some terminology you don't really hear outside the document. Chapter 2, General Structure, is worth reading in its entirety. I've linked some bits that might help:

> *2.2.3 Characters, Not Glyphs*

> The Unicode Standard draws a distinction between characters and glyphs. Characters are the abstract representations of the smallest components of written language that have semantic value. They represent primarily, but not exclusively, the letters, punctuation, and other signs that constitute natural language text and technical notation. [...] Letters in different scripts, even when they correspond either semantically or graphically, are represented in Unicode by distinct characters.

> Characters are represented by code points that reside only in a memory representation, as strings in memory, on disk, or in data transmission. The Unicode Standard deals only with character codes.

> *2.4 Code Points and Characters*

> The range of integers used to code the abstract characters is called the codespace. A particular integer in this set is called a code point. When an abstract character is mapped or assigned to a particular code point in the codespace, it is then referred to as an encoded character.

> *2.5 Encoding Forms*

This deals with UTF-{8,16,32}, which is a tricky bit and tripped me up for a long time. If the document is too dense here, there's a lot of supplementary material online explaining the different forms, I'll link a Tom Scott video explaining UTF-8.

---

The long and short of it is: the atomic unit of Unicode is the character, or encoded character, which is a value that has been associated with a code point, which is an integer usually represented in hex for as U+XXXX. Unicode doesn't deal with glyphs or graphical representations, just characters and their properties (eg. what is the character name? what should this character do when uppercased?). As you probably know, many characters can combine with others to form grapheme clusters, which may look like a single (abstract) character, but underneath consist of multiple (encoded) characters. Every character is associated with an integer index (a codepoint), and those integers can be represented in three formats (this sort of happened by accident): UTF-32 (just represent the integer directly), UTF-16 (was originally supposed to represent the integer directly, but there were too many and it got extended), and UTF-8 (which has different byte lengths to encode different characters efficiently).

[spec] https://www.unicode.org/versions/Unicode16.0.0/core-spec/

[2.2.3] https://www.unicode.org/versions/Unicode16.0.0/core-spec/cha...

[2.4] https://www.unicode.org/versions/Unicode16.0.0/core-spec/cha...

[2.5] https://www.unicode.org/versions/Unicode16.0.0/core-spec/cha...

[Tom Scott UTF-8] https://www.youtube.com/watch?v=MijmeoH9LT4

NoahZuniga•7mo ago
I guess its kind of annoying that letters with diacritics can be represented in multiple different ways
Rendello•7mo ago
That's true, and even with normalization, there's four normalized forms for strings. The -k- forms are mostly for searching, but that still leaves NFC and NFD.

The normalization forms are explained, in order of approachability (imo), in this random Youtube video, the Unicode Annex #15, and the Unicode Core Spec:

https://www.youtube.com/watch?v=ttLD4DiMpiQ

https://unicode.org/reports/tr15/

https://www.unicode.org/versions/Unicode16.0.0/core-spec/cha...

bjourne•7mo ago
Comparing strings by bytecode equality is kinda dubious anyway.
Rendello•7mo ago
String comparison is a difficult problem. Consider:

Å (ANGSTROM SIGN)

Å (LATIN CAPITAL LETTER A WITH RING ABOVE)

Å (LATIN CAPITAL LETTER A) + (◌̊ COMBINING RING ABOVE)

А̊ (CYRILLIC CAPITAL LETTER A) + (◌̊ COMBINING RING ABOVE)

Of these, the Angstrom Sign is considered deprecated and won't show up in any normal forms. The second is the NFC (composed) form, and the third is the NFD (decomposed) form. The Cyrillic one looks the same, but is not the same abstract character, so isn't connected in any normalization form.

Normal forms also reorder the diacritics if there are multiple. The strings could be compared through their normalized encoded forms (like UTF-8), which I think is what you meant, or their normalized code points directly. I agree it can be messy, but I'm curious what you meant by dubious, do you think there's a better way?

0xCE0•7mo ago
The original intent of Unicode was great: a standard that creates a mapping between a unique number==codepoint and specific character of language (and here character means only abstract non-visual symbol==meaning, not visually rendered glyph with stylistic font of any kind). The updates for Unicode versions added more languages, even dead ones. So basically it was a historical knowledge effort also.

Then came emojis, and now the Unicode Consortium's efforts for Unicode version updates seems to be about adding more different kinds of poop emojis and shades of skin colors. Well, maybe it projects accurately the language and culture of this modern time.

UTF-8 is great because it is a superset of ASCII, but because its byte-width varies, it has more complexity for decoding/encoding it (similar to constant/variable width ISA's in CPUs).

Different languages have different concepts, e.g. text direction==flow (left/right, up/down, characters/logograms, different kind of visual cues etc.). Humans create problems when they want to combine different languages at the same time. E.g. mathematical notation is in my opinion 2D graphics, and it cannot be (usually/always) inlined with text glyphs (to be aesthetically pleasing). Same kind of problems may come when trying to inline e.g. languages with different flow directions. Its like trying to combine native GUI widgets in Win32 and Cocoa/SwiftUI and GTK/Qt/WXwidgets - the (visual) languages doesn't have the same concepts or they are conflicting.

Rendello•7mo ago
For what it's worth, the Unicode Consortium seems to be trying to reign in the emoji explosion in the last few years. For example they won't process any new proposals for flags [1][2]:

> The Unicode Consortium will no longer accept proposals for flags. Flags that correspond to officially assigned ISO 3166-1 alpha-2 region codes are automatically added, with no proposals necessary.

And they decided against adding Multi-skintoned Families to the RGI, as in, vendors can encode them if they really want to, but it's not recommended. Apple for example replaced their more complex family emoji with the recommended silhouettes afterwards [4].

1. https://blog.unicode.org/2022/03/the-past-and-future-of-flag...

2. https://unicode.org/emoji/proposals.html#Flags

3. https://www.unicode.org/L2/L2020/20114-family-emoji-explor.p...

4. https://blog.emojipedia.org/ios-17-4-emoji-changelog/

arp242•7mo ago
The emoji horse bolted long before Unicode. Everything from MSN Messenger to web forums had their own implementation of it. And that was a continuation of various ASCII emojis ranging from the simple :-) to the more complex ¯\_/(ツ)\_/¯

To say nothing that various emojis have been part of Unicode since pretty much the start, and was part of other encoding schemes as well (notably in Japan, but also e.g. the the "Outlook J").

And if you actually look at the changes in Unicode versions, you'll see there are tons of language-related changes in every one. To say Unicode updates are just about emoji updates is just silly. The reason you don't notice is because this is mostly for small language, obscure features in larger languages, and historical languages, and things like that.

0xCE0•7mo ago
Great corrections from both of you.