frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Near-Instantly Aborting the Worst Pain Imaginable with Psychedelics

https://psychotechnology.substack.com/p/near-instantly-aborting-the-worst
1•eatitraw•58s ago•0 comments

Show HN: Nginx-defender – realtime abuse blocking for Nginx

https://github.com/Anipaleja/nginx-defender
2•anipaleja•1m ago•0 comments

The Super Sharp Blade

https://netzhansa.com/the-super-sharp-blade/
1•robin_reala•2m ago•0 comments

Smart Homes Are Terrible

https://www.theatlantic.com/ideas/2026/02/smart-homes-technology/685867/
1•tusslewake•4m ago•0 comments

What I haven't figured out

https://macwright.com/2026/01/29/what-i-havent-figured-out
1•stevekrouse•4m ago•0 comments

KPMG pressed its auditor to pass on AI cost savings

https://www.irishtimes.com/business/2026/02/06/kpmg-pressed-its-auditor-to-pass-on-ai-cost-savings/
1•cainxinth•5m ago•0 comments

Open-source Claude skill that optimizes Hinge profiles. Pretty well.

https://twitter.com/b1rdmania/status/2020155122181869666
2•birdmania•5m ago•1 comments

First Proof

https://arxiv.org/abs/2602.05192
2•samasblack•7m ago•1 comments

I squeezed a BERT sentiment analyzer into 1GB RAM on a $5 VPS

https://mohammedeabdelaziz.github.io/articles/trendscope-market-scanner
1•mohammede•8m ago•0 comments

Kagi Translate

https://translate.kagi.com
2•microflash•9m ago•0 comments

Building Interactive C/C++ workflows in Jupyter through Clang-REPL [video]

https://fosdem.org/2026/schedule/event/QX3RPH-building_interactive_cc_workflows_in_jupyter_throug...
1•stabbles•10m ago•0 comments

Tactical tornado is the new default

https://olano.dev/blog/tactical-tornado/
1•facundo_olano•12m ago•0 comments

Full-Circle Test-Driven Firmware Development with OpenClaw

https://blog.adafruit.com/2026/02/07/full-circle-test-driven-firmware-development-with-openclaw/
1•ptorrone•12m ago•0 comments

Automating Myself Out of My Job – Part 2

https://blog.dsa.club/automation-series/automating-myself-out-of-my-job-part-2/
1•funnyfoobar•12m ago•0 comments

Google staff call for firm to cut ties with ICE

https://www.bbc.com/news/articles/cvgjg98vmzjo
30•tartoran•13m ago•2 comments

Dependency Resolution Methods

https://nesbitt.io/2026/02/06/dependency-resolution-methods.html
1•zdw•13m ago•0 comments

Crypto firm apologises for sending Bitcoin users $40B by mistake

https://www.msn.com/en-ie/money/other/crypto-firm-apologises-for-sending-bitcoin-users-40-billion...
1•Someone•13m ago•0 comments

Show HN: iPlotCSV: CSV Data, Visualized Beautifully for Free

https://www.iplotcsv.com/demo
1•maxmoq•14m ago•0 comments

There's no such thing as "tech" (Ten years later)

https://www.anildash.com/2026/02/06/no-such-thing-as-tech/
1•headalgorithm•15m ago•0 comments

List of unproven and disproven cancer treatments

https://en.wikipedia.org/wiki/List_of_unproven_and_disproven_cancer_treatments
1•brightbeige•15m ago•0 comments

Me/CFS: The blind spot in proactive medicine (Open Letter)

https://github.com/debugmeplease/debug-ME
1•debugmeplease•15m ago•1 comments

Ask HN: What are the word games do you play everyday?

1•gogo61•18m ago•1 comments

Show HN: Paper Arena – A social trading feed where only AI agents can post

https://paperinvest.io/arena
1•andrenorman•20m ago•0 comments

TOSTracker – The AI Training Asymmetry

https://tostracker.app/analysis/ai-training
1•tldrthelaw•24m ago•0 comments

The Devil Inside GitHub

https://blog.melashri.net/micro/github-devil/
2•elashri•24m ago•0 comments

Show HN: Distill – Migrate LLM agents from expensive to cheap models

https://github.com/ricardomoratomateos/distill
1•ricardomorato•24m ago•0 comments

Show HN: Sigma Runtime – Maintaining 100% Fact Integrity over 120 LLM Cycles

https://github.com/sigmastratum/documentation/tree/main/sigma-runtime/SR-053
1•teugent•24m ago•0 comments

Make a local open-source AI chatbot with access to Fedora documentation

https://fedoramagazine.org/how-to-make-a-local-open-source-ai-chatbot-who-has-access-to-fedora-do...
1•jadedtuna•26m ago•0 comments

Introduce the Vouch/Denouncement Contribution Model by Mitchellh

https://github.com/ghostty-org/ghostty/pull/10559
1•samtrack2019•26m ago•0 comments

Software Factories and the Agentic Moment

https://factory.strongdm.ai/
1•mellosouls•27m ago•1 comments
Open in hackernews

Unicode Footguns in Python

https://pythonkoans.substack.com/p/koan-15-the-invisible-ink
43•meander_water•3mo ago

Comments

naIak•3mo ago
I’m going to trigger some ptsd with this…

UnicodeDecodeError

morshu9001•3mo ago
Unicode footguns, in Python
OkayPhysicist•3mo ago
Frankly, the key takeaway to most problems people run into with Unicode is that there are very, very few operations that are universally well-defined for arbitrary user-provided text. Pretty much the moment you step outside the realm of "receive, copy, save, regurgitate", you're probably going to run into edge cases.
dhosek•3mo ago
Grapheme count is not a useful number. Even in a monospaced font, you’ll find that the grapheme count doesn’t give you a measurement of width since emoji will usually not be the same width as other characters.
paulddraper•3mo ago
Grapheme count (or rather, indexing) is necessary to do text selection or cursor positions.

Fortunately you can usually outsource this to a UI toolkit which can do it.

WorldMaker•3mo ago
Most of the UI toolkits defer to font-based layout calculation engines rather than grapheme counts. Grapheme counts are a handy approximation in many cases, but you aren't going to get truly accurate text selection or cursor positions without real geometry from a font and whatever layout engine is closest to it.

(Fonts may disagree on supported ligatures, for instance, or not support an emoji grapheme cluster and fall back to displaying multiple component emoji, or layout multiple graphemes in two dimensions for a given language [CJK, Arabic, even Math symbols, even before factoring in if a font layout engine supports the optional but cool UnicodeMath [0]] or any number of other tweaks and distinctions between encoding a block of text and displaying a block of text.)

[0] https://www.unicode.org/notes/tn28/UTN28-PlainTextMath-v3.1....

dhosek•3mo ago
Ligatures are a whole other thing independent of graphemes. Grapheme clustering is based entirely independently of the font and a sequence like U+1F1E6 U+1F1E6 is considered a single grapheme according to the Unicode specification even though no font will display this as a single character (it will instead be rendered as something resembling two A’s in boxes in most fonts).
WorldMaker•3mo ago
I included ligatures as a logically separate concept that needs to be accounted for when counting the "displayed width" of a string and how you handle cursor position and selection logic.

However, ligatures are a part of grapheme clustering and it isn't entirely independent: Examples above and elsewhere include ligatures that have Unicode encodings (ex: fi). Ligatures have been a part of character encodings and have affected grapheme clustering since EBCDIC (directly from which Unicode inherits a lot of its single codepoint Latin ligatures). There are way too many debates if normal forms should encode ligatures or always decompose them or some other strategy. The normal forms (NFC and NFD, NFKC and NFKD) themselves are embedded as a step of several of the standard grapheme clustering algorithms.

Some people think ligatures should be entirely left to fonts and Unicode ligatures a relic of the IBM bitmap font past. Some fonts think it would be nice if some ligatures were more directly encoded and normal forms could do a first pass for them. Unicode has had portions of the specification on both sides of that argument. It generally leans away from the backward compatibility ligatures and generally normalizes them to decomposition, especially in locales like "en-us" these days, but it doesn't always and in every locale. (All of that is before you start to consider languages that are almost nothing but long sequences of ligatures, including but not limited to Arabic.)

You can't do grapheme clustering and entirely ignore ligatures. You certainly can't count "display width" or "display position" without paying attention to ligatures, which was the point of bringing them up alongside mentions of grapheme clustering length and why it is insufficient for "display position" and "display width".

dhosek•3mo ago
Umm, you’re confusing things here. When you said “several of the standard grapheme clustering algorithms” that indicated the first element of confusion. There is one grapheme clustering algorithm in Unicode.¹ Ligature combinations, whether encoded in Unicode or not are independent of the concept of graphemes and are very much font-dependent, as is whether or not a grapheme cluster is displayed as multiple characters or a single character (any two consecutive regional indicators are considered a single grapheme, whether they map to a national flag or not and the set of national flags and whether those flags are even displayed or the output is just the regional indicator symbol is up to the font). Normalization does not play a role in grapheme clustering at all, which is why á and a+acute accent are both considered single graphemes.

I’m puzzled about the assertion of EBCDIC having Latin ligatures because I never encountered them in my IBM mainframe days and a casual search didn’t turn any up. The only 8-bit encoding that I was aware of that included any ligatures was Mac extended ASCII which included fi and fl (well, putting aside TeX’s 7-bit encoding which was only used by TeX and because of its using x20 to hold the acute accent was unusable by other applications which expected a blank character in that spot).

The question about dealing with ligatures for non-Latin scripts generally came down to compatibility with existing systems more than anything else. This is why, for example, the Devanagari and Thai encodings which both have vowel markings which can occur before, after or around a consonant handle the sequence of input differently. Assembly of jambo into characters in Hangul is another case where theoretically, one could have all the displayed characters handled through ligatures and not encode the syllables directly (which is, in fact, how most modern Korean is encoded as evidenced by Korean wikipedia), but because the original Korean font encoding has all the syllables encoded in it² those syllables are part of Unicode as well.

But the bottom line here, is that you seem to be confusing terminology a lot here. You can very much do grapheme clustering without paying attention to ligatures, and rules for things like normalized composition/decomposition are entirely independent of grapheme clustering (the K-forms of both, manage transitions like ¹ to 1 or fi to fi and represent a one way transition).

⸻

1. I wrote a Rust library implementing this and I’ve been following Unicode since it was a proposal from Microsoft and Apple in the 90s, so I know a little about this.

2. I think the more accurate term would be “most” of the syllables as I’ve seen additional syllables added to Unicode over time.

WorldMaker•2mo ago
My point has been grapheme clustering doesn't matter as a useful length on its own, especially not for display length. The technical specifics/technical corrections are getting into weeds that are orthogonal to point of the discussion in the first place.

> I’m puzzled about the assertion of EBCDIC having Latin ligatures because I never encountered them in my IBM mainframe days and a casual search didn’t turn any up. The only 8-bit encoding that I was aware of that included any ligatures was Mac extended ASCII which included fi and fl (well, putting aside TeX’s 7-bit encoding which was only used by TeX and because of its using x20 to hold the acute accent was unusable by other applications which expected a blank character in that spot).

EBCDIC had multiple "code pages" to handle various international encodings. (DOS and Windows inherited the "code page" concept from IBM mainframes.) Of the code pages EBCDIC supported in the mainframe era, many were "Publishing" code pages intended for "pretty printing" text to printers that supported presumably and primarily bitmap fonts. A low ID example of such is IBM Code Page (CCSID) 361: https://web.archive.org/web/20130121103957/http://www-03.ibm...

You can see that code page includes fi, fl, ff, ffi, ij, among others even less common in English text.

Most but not all of the IBM Code Pages were a part of the earliest Unicode standards encoding efforts.

dhosek•2mo ago
There were numerous other standardized code pages beyond those. ij is a digraph traditionally used in Dutch and was present in at least one of the ECMA code pages and isn’t really a publication ligature (it’s more on par with, e.g., the dz digraph for Croatian). Once upon a time I had a four inch binder with printouts of every single ECMA-accepted codepage which in addition to all the national/regional 8-bit encodings also included a number of Asian 16-bit encodings including three separate Chinese encodings (PRC, Taiwan, Hong Kong).
Spivak•3mo ago
For certain use-cases, but it's not like any of the other usual notions of text length are any better for what you want.
lmm•3mo ago
If all possible notions of length are footguns, maybe there should be no default "length" operation available.
WorldMaker•3mo ago
Indeed. Several languages have debated dropping or already dropped an easy to access "Length" count on strings and making it much more explicit if you want "UTF-8 Encoded Length" or "Codepoint count" or "Grapheme count" or "Grapheme cluster count" or "Laid out font width".

Why endorse a bad winner when you can make more of the trade-offs more obvious and give programmers a better chance of asking for the right information instead of using the wrong information because it is the default and assuming it is correct?

renhanxue•3mo ago
The article has good tips, but Unicode normalization is just the tip of the iceberg. It is almost always impossible to do what your users expect without locale information (different languages and locales sort and compare the same graphemes differently). "What do we mean when we say two strings are equal" can be a surprisingly difficult question to answer. It's practical too, not philosophical.

By the way, try looking up the standardized Unicode casefolding algorithm sometime, it is a thing to behold.

Groxx•3mo ago
the normalization doc is interesting too imo: https://unicode.org/reports/tr15/

in particular, the differences between NFC and NFKC are "fun", and rather meaningful in many cases. e.g. NFC says that "fi" and "fi" are different and not equal, though the latter is just a ligature of the former and is literally identical in meaning. this applies to ffi too. half vs full width Chinese characters are also "different" under NFC. NFKC makes those examples equal though... at the cost of saying "2⁵" is equal to "25".

language is fun!

o11c•3mo ago
I've said this before and have said it again: Python3 got rid of the wrong string type.

With `bytes` it was obvious that byte length was not the same as $whatever length, and that was really the only semi-common bug (and was mostly limited to English speakers who are new to programming). All other bugs come from blindly trusting `unicode` whose bugs are far more subtle and numerous.

Flimm•3mo ago
I strongly disagree. Python 2 had no bytes type to get rid of. It had a string type that could not handle code points above U+00FF at all, and could not handle code points above U+007F very well. In addition, Python 2 had a Unicode type, and the types would get automatically converted to each other and/or encoded/decoded, often incorrectly, and sometimes throwing runtime exceptions.

Python 3 introduced the bytes type that you like so much. It sounds like you would enjoy a Python 4 with only a bytes type and no string type, and presumably with a strong convention to only use UTF-8 or with required encoding arguments everywhere.

In both Python 2 and Python 3, you still have to learn how to handle grapheme clusters carefully.

seanhunter•3mo ago
Python 3 didn't get rid of bytes though. If you want to manipulate data as bytes you absolutely can do that.

https://docs.python.org/3/library/stdtypes.html#binary-seque...

"The core built-in types for manipulating binary data are bytes and bytearray."

o11c•3mo ago
Those are arrays of integers, not of bytes. Most bytes are character-ish, which only python2's bytes acknowledged.

Additionally, python2 supported a much richer set of operations on its bytes type than python3 does.