frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Slightly better named character reference tokenization than Chrome, Safari, FF

https://www.ryanliptak.com/blog/better-named-character-reference-tokenization/
36•todsacerdoti•23h ago

Comments

deepdarkforest•4h ago
This might not get a lot of traction because it's very technical, but i wanted to say a massive well done for the effort. 20k words on anything this specific is not a joke. I wish i would put this level of commitment to anything in life, this was inspiring if nothing else.
squeek502•4h ago
Appreciate it (I'm the author). I'd like to think there's a good bit of interesting stuff in here outside of the specific topic of named character reference tokenization.
chaps•3h ago
"no[t] a 'data structures' person"

says the person who wrote an extremely technical 20k word blog post on data structures! <3

arthurcolle•3h ago
Congratulations on your newfound promotion to data structures person btw
Ndymium•2h ago
Thanks to your article I just realised my HTML entity codec library doesn't support decoding those named entities that can omit the semicolon at the end. More work for me, good thing my summer vacation just started! :)
masfuerte•1h ago
That was a good read. I reread the relevant section of the HTML5 spec and noticed an error in an example:

> For example, &not;in will be parsed as "¬in" whereas &notin will be parsed as "∉".

Only a small minority of the named character references are permitted without a closing semicolon, and notin is not one of them. So &notin is actually parsed as "¬in". &notin; is parsed as "∉".

https://html.spec.whatwg.org/#parse-error-missing-semicolon-...

squeek502•1h ago
Good catch, that does indeed look like a mistake in the spec. Everything past the first sentence of that error description is suspect, honestly (seems like it was naively adapted from the example in [1] but that example isn't relevant to the missing-semicolon-after-character-reference error).

Will submit an issue/PR to correct it when I get a chance.

[1] https://html.spec.whatwg.org/multipage/parsing.html#named-ch...

o11c•18m ago
Congratulations, you've reinvented regexes. This is still a win since you're using the sane kind of regex and are allowing multiple accept states rather than just one, in both cases unlike most modern implementations.

(I'm mostly throwing my thoughts as they appear, some parts of this ends up duplicating what's in the article, hopefully with more standard terminology though)

Note that at runtime there is no difference between a standard DFA and what you can a DAFSA. The difference is entirely at construction time.

In lexers, your `end_of_word` is usually called `accept`, and rather than being a `bool` it is an integer (0 for no-accept, N for the Nth valid accept value, which in your case should probably be an index within the array of all possible characters. Note that since multiple entity names map to the same character, you will have multiple nodes with the same `accept`). I think your perfect-hash approach requires duplicating them (which admittedly might be a win since you are far from the typical lexing case where there are many possible inputs for some outputs. However, this does mean you can't play games with the bits of accept` to encode the length of your lookup as well as the start - if we're saving size, I lean toward UTF-8, either nul-terminated or with an explicit length).

The next thing you should do is use equivalence classes rather than dealing with every character individually. For this particular parsing problem, almost all of your equivalence classes will only have a single character, but you still win big by mapping all invalid characters to a single class. Since there are only 51 characters used in entity names, this means you only need 6 bits per character (which should be fast since you only need to special-case non-letters). And since many of those only appear for the first letter, you can probably deal with 5 or fewer with minimal logic ahead of time.

That said - one important lesson from lexing is that it is almost always a mistake to lex keywords; whenever possible, just lex an identifier and then do a map lookup. The reason that can't be done is entirely because of those entities which do not require the semicolon, so I suspect that the optimal approach is going to be: after resolving `document.write`, look ahead for a semicolon, and if found use the fast path; only if that fails, enter the (much smaller) DFA for the few that do not require a semicolon. But since you don't have identifiers you might not be hitting the worst case (explosive splitting) anyway.

For something this small, binary search is probably a mistake (being very unpredictable for the CPU) if you're doing everything else right; you're better off doing a linear search if you can't just using SIMD magic to match them in parallel. Struct-of-arrays is probably pointless for a problem set that fits in L1, but might start winning again if you want to leave some L1 for other parts of the program. Storing siblings/cousins next to each other (as an accident of construction) means you're probably already as Eytzinger-like as you can be.

(Edit: fix incomplete and missing thoughts)

Show HN: I'm an airline pilot – I built interactive graphs/globes of my flights

https://jameshard.ing/pilot
975•jamesharding•10h ago•158 comments

Normalizing Flows Are Capable Generative Models

https://machinelearning.apple.com/research/normalizing-flows
60•danboarder•3h ago•4 comments

Learn OCaml – Exercises

https://ocaml-sf.org/learn-ocaml-public/#activity=exercises
45•smartmic•3h ago•10 comments

A Brief History of Children Sent Through the Mail

https://www.smithsonianmag.com/smart-news/brief-history-children-sent-through-mail-180959372/
70•m-hodges•3h ago•48 comments

Structuring Arrays with Algebraic Shapes

https://dl.acm.org/doi/abs/10.1145/3736112.3736141
53•todsacerdoti•4h ago•4 comments

SymbolicAI: A neuro-symbolic perspective on LLMs

https://github.com/ExtensityAI/symbolicai
78•futurisold•5h ago•23 comments

Qwen VLo: From "Understanding" the World to "Depicting" It

https://qwenlm.github.io/blog/qwen-vlo/
155•lnyan•9h ago•44 comments

James Webb Space Telescope Reveals Its First Direct Image of an Exoplanet

https://www.smithsonianmag.com/smart-news/james-webb-space-telescope-reveals-its-first-direct-image-discovery-of-an-exoplanet-180986886/
77•divbzero•6h ago•38 comments

10 Years of Pomological Watercolors

https://parkerhiggins.net/2025/04/10-years-of-pomological-watercolors/
160•fanf2•9h ago•27 comments

bootc-image-builder: Build your entire OS from a Containerfile

https://github.com/osbuild/bootc-image-builder
14•twelvenmonkeys•3d ago•2 comments

Reinforcement learning, explained with a minimum of math and jargon

https://www.understandingai.org/p/reinforcement-learning-explained
21•JnBrymn•3d ago•0 comments

Transmitting data via ultrasound without any special equipment

https://halcy.de/blog/2025/06/27/transmitting-data-via-ultrasound-without-any-special-equipment/
81•todsacerdoti•7h ago•27 comments

Spark AI (YC W24) is hiring a full-stack engineer in SF (founding team)

https://www.ycombinator.com/companies/spark/jobs/kDeJlPK-software-engineer-full-stack-founding-team
1•juliawu•3h ago

nimbme – Nim bare-metal environment

https://github.com/mikra01/nimbme
36•michaelsbradley•5h ago•6 comments

Rust in the Linux kernel: part 2

https://lwn.net/SubscriberLink/1025232/fbb2d90d084368e3/
29•chmaynard•1h ago•0 comments

The Journey of Bypassing Ubuntu's Unprivileged Namespace Restriction

https://u1f383.github.io/linux/2025/06/26/the-journey-of-bypassing-ubuntus-unprivileged-namespace-restriction.html
8•Bogdanp•2h ago•1 comments

Weird Expressions in Rust

https://www.wakunguma.com/blog/rust-weird-expr
135•lukastyrychtr•8h ago•105 comments

Slightly better named character reference tokenization than Chrome, Safari, FF

https://www.ryanliptak.com/blog/better-named-character-reference-tokenization/
36•todsacerdoti•23h ago•8 comments

A New Kind of Computer (April 2025)

https://lightmatter.co/blog/a-new-kind-of-computer/
33•gkolli•3d ago•15 comments

Glass nanostructures reflect nearly all visible light, challenging assumptions

https://phys.org/news/2025-06-glass-nanostructures-visible-photonics-assumptions.html
19•bookofjoe•3d ago•4 comments

Whitesmiths C compiler: One of the earliest commercial C compilers available

https://github.com/hansake/Whitesmiths-C-compiler
93•todsacerdoti•4d ago•24 comments

Project Vend: Can Claude run a small shop? (And why does that matter?)

https://www.anthropic.com/research/project-vend-1
175•gk1•7h ago•73 comments

New Process Uses Microbes to Create Valuable Materials from Urine

https://newscenter.lbl.gov/2025/06/17/new-process-uses-microbes-to-create-valuable-materials-from-urine/
11•gmays•5h ago•4 comments

Parameterized types in C using the new tag compatibility rule

https://nullprogram.com/blog/2025/06/26/
125•ingve•18h ago•59 comments

PJ5 TTL CPU

https://pj5cpu.wordpress.com/
77•doener•16h ago•1 comments

US Supreme Court limits federal judges' power to block Trump orders

https://www.theguardian.com/us-news/2025/jun/27/trump-supreme-court-birthright-citizenship-scotus
289•leotravis10•6h ago•437 comments

Sailing the fjords like the Vikings yields unexpected insights

https://arstechnica.com/science/2025/06/this-archaeologist-built-a-replica-boat-to-sail-like-the-vikings/
133•pseudolus•4d ago•49 comments

Show HN: Sink – Sync any directory with any device on your local network

https://github.com/sirbread/sink
109•sirbread•18h ago•80 comments

Alternative Layout System

https://alternativelayoutsystem.com/scripts/#same-sizer
355•smartmic•1d ago•60 comments

Show HN: Zenta – Mindfulness for Terminal Users

https://github.com/e6a5/zenta
171•ihiep•15h ago•33 comments