frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Start all of your commands with a comma

https://rhodesmill.org/brandon/2009/commands-with-comma/
163•theblazehen•2d ago•47 comments

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

https://openciv3.org/
674•klaussilveira•14h ago•202 comments

The Waymo World Model

https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-frontier-for-autonomous-driving-simula...
950•xnx•20h ago•552 comments

How we made geo joins 400× faster with H3 indexes

https://floedb.ai/blog/how-we-made-geo-joins-400-faster-with-h3-indexes
123•matheusalmeida•2d ago•33 comments

Jeffrey Snover: "Welcome to the Room"

https://www.jsnover.com/blog/2026/02/01/welcome-to-the-room/
22•kaonwarb•3d ago•19 comments

Unseen Footage of Atari Battlezone Arcade Cabinet Production

https://arcadeblogger.com/2026/02/02/unseen-footage-of-atari-battlezone-cabinet-production/
58•videotopia•4d ago•2 comments

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

https://github.com/valdanylchuk/breezydemo
232•isitcontent•14h ago•25 comments

Monty: A minimal, secure Python interpreter written in Rust for use by AI

https://github.com/pydantic/monty
225•dmpetrov•15h ago•118 comments

Show HN: I spent 4 years building a UI design tool with only the features I use

https://vecti.com
332•vecti•16h ago•144 comments

Hackers (1995) Animated Experience

https://hackers-1995.vercel.app/
495•todsacerdoti•22h ago•243 comments

Sheldon Brown's Bicycle Technical Info

https://www.sheldonbrown.com/
383•ostacke•20h ago•95 comments

Microsoft open-sources LiteBox, a security-focused library OS

https://github.com/microsoft/litebox
360•aktau•21h ago•182 comments

Show HN: If you lose your memory, how to regain access to your computer?

https://eljojo.github.io/rememory/
289•eljojo•17h ago•175 comments

An Update on Heroku

https://www.heroku.com/blog/an-update-on-heroku/
413•lstoll•21h ago•279 comments

Vocal Guide – belt sing without killing yourself

https://jesperordrup.github.io/vocal-guide/
32•jesperordrup•4h ago•16 comments

Was Benoit Mandelbrot a hedgehog or a fox?

https://arxiv.org/abs/2602.01122
20•bikenaga•3d ago•8 comments

Where did all the starships go?

https://www.datawrapper.de/blog/science-fiction-decline
17•speckx•3d ago•7 comments

PC Floppy Copy Protection: Vault Prolok

https://martypc.blogspot.com/2024/09/pc-floppy-copy-protection-vault-prolok.html
63•kmm•5d ago•7 comments

Dark Alley Mathematics

https://blog.szczepan.org/blog/three-points/
91•quibono•4d ago•21 comments

How to effectively write quality code with AI

https://heidenstedt.org/posts/2026/how-to-effectively-write-quality-code-with-ai/
258•i5heu•17h ago•196 comments

Delimited Continuations vs. Lwt for Threads

https://mirageos.org/blog/delimcc-vs-lwt
32•romes•4d ago•3 comments

What Is Ruliology?

https://writings.stephenwolfram.com/2026/01/what-is-ruliology/
44•helloplanets•4d ago•42 comments

Introducing the Developer Knowledge API and MCP Server

https://developers.googleblog.com/introducing-the-developer-knowledge-api-and-mcp-server/
60•gfortaine•12h ago•26 comments

I now assume that all ads on Apple news are scams

https://kirkville.com/i-now-assume-that-all-ads-on-apple-news-are-scams/
1070•cdrnsf•1d ago•446 comments

Female Asian Elephant Calf Born at the Smithsonian National Zoo

https://www.si.edu/newsdesk/releases/female-asian-elephant-calf-born-smithsonians-national-zoo-an...
36•gmays•9h ago•12 comments

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

https://infisical.com/blog/devops-to-solutions-engineering
150•vmatsiiako•19h ago•70 comments

Understanding Neural Network, Visually

https://visualrambling.space/neural-network/
288•surprisetalk•3d ago•43 comments

Why I Joined OpenAI

https://www.brendangregg.com/blog/2026-02-07/why-i-joined-openai.html
150•SerCe•10h ago•142 comments

Learning from context is harder than we thought

https://hy.tencent.com/research/100025?langVersion=en
186•limoce•3d ago•100 comments

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

https://github.com/phreda4/r3
73•phreda4•14h ago•14 comments
Open in hackernews

Slightly better named character reference tokenization than Chrome, Safari, FF

https://www.ryanliptak.com/blog/better-named-character-reference-tokenization/
64•todsacerdoti•7mo ago

Comments

deepdarkforest•7mo ago
This might not get a lot of traction because it's very technical, but i wanted to say a massive well done for the effort. 20k words on anything this specific is not a joke. I wish i would put this level of commitment to anything in life, this was inspiring if nothing else.
squeek502•7mo ago
Appreciate it (I'm the author). I'd like to think there's a good bit of interesting stuff in here outside of the specific topic of named character reference tokenization.
chaps•7mo ago
"no[t] a 'data structures' person"

says the person who wrote an extremely technical 20k word blog post on data structures! <3

arthurcolle•7mo ago
Congratulations on your newfound promotion to data structures person btw
Ndymium•7mo ago
Thanks to your article I just realised my HTML entity codec library doesn't support decoding those named entities that can omit the semicolon at the end. More work for me, good thing my summer vacation just started! :)
masfuerte•7mo ago
That was a good read. I reread the relevant section of the HTML5 spec and noticed an error in an example:

> For example, &not;in will be parsed as "¬in" whereas &notin will be parsed as "∉".

Only a small minority of the named character references are permitted without a closing semicolon, and notin is not one of them. So &notin is actually parsed as "¬in". &notin; is parsed as "∉".

https://html.spec.whatwg.org/#parse-error-missing-semicolon-...

squeek502•7mo ago
Good catch, that does indeed look like a mistake in the spec. Everything past the first sentence of that error description is suspect, honestly (seems like it was naively adapted from the example in [1] but that example isn't relevant to the missing-semicolon-after-character-reference error).

Will submit an issue/PR to correct it when I get a chance.

[1] https://html.spec.whatwg.org/multipage/parsing.html#named-ch...

o11c•7mo ago
Congratulations, you've reinvented regexes. This is still a win since you're using the sane kind of regex and are allowing multiple accept states rather than just one, in both cases unlike most modern implementations.

(I'm mostly throwing my thoughts as they appear, some parts of this ends up duplicating what's in the article, hopefully with more standard terminology though)

Note that at runtime there is no difference between a standard DFA and what you can a DAFSA. The difference is entirely at construction time.

In lexers, your `end_of_word` is usually called `accept`, and rather than being a `bool` it is an integer (0 for no-accept, N for the Nth valid accept value, which in your case should probably be an index within the array of all possible characters. Note that since multiple entity names map to the same character, you will have multiple nodes with the same `accept`). I think your perfect-hash approach requires duplicating them (which admittedly might be a win since you are far from the typical lexing case where there are many possible inputs for some outputs. However, this does mean you can't play games with the bits of accept` to encode the length of your lookup as well as the start - if we're saving size, I lean toward UTF-8, either nul-terminated or with an explicit length).

The next thing you should do is use equivalence classes rather than dealing with every character individually. For this particular parsing problem, almost all of your equivalence classes will only have a single character, but you still win big by mapping all invalid characters to a single class. Since there are only 51 characters used in entity names, this means you only need 6 bits per character (which should be fast since you only need to special-case non-letters). And since many of those only appear for the first letter, you can probably deal with 5 or fewer with minimal logic ahead of time.

That said - one important lesson from lexing is that it is almost always a mistake to lex keywords; whenever possible, just lex an identifier and then do a map lookup. The reason that can't be done is entirely because of those entities which do not require the semicolon, so I suspect that the optimal approach is going to be: after resolving `document.write`, look ahead for a semicolon, and if found use the fast path; only if that fails, enter the (much smaller) DFA for the few that do not require a semicolon. But since you don't have identifiers you might not be hitting the worst case (explosive splitting) anyway.

For something this small, binary search is probably a mistake (being very unpredictable for the CPU) if you're doing everything else right; you're better off doing a linear search if you can't just using SIMD magic to match them in parallel. Struct-of-arrays is probably pointless for a problem set that fits in L1, but might start winning again if you want to leave some L1 for other parts of the program. Storing siblings/cousins next to each other (as an accident of construction) means you're probably already as Eytzinger-like as you can be.

(Edit: fix incomplete and missing thoughts)

o11c•7mo ago
Actually, there's one more trick I just remembered - you don't have to store an integer for `accept` at all, since you can arrange for the final state numbers to all be adjacent (usually, the first N positive integers; you probably want to save 0 as your fail state and use N+1 as your start state).

If you have splitting you'll have to duplicate accept states, so you can't just count your regexes. For example:

  three_as = /aaa/
  three_bs = /bba/
  a_or_b = /[ab]/

  accept_values = [error, a_or_b, a_or_b, three_as, three_bs]
  state 0: . -> 0; error state
  state 1: a -> 6, b -> 0; accept state after "a"
  state 2: a -> 0, b -> 7; accept state after "b"
  state 3: . -> 0; accept state after "aaa"
  state 4: . -> 0; accept state after "bba"
  state 5: a -> 1, b -> 2; start state
  state 6: a -> 3, b -> 0; intermediate state after "aa"
  state 7: a -> 0, b -> 4; intermediate state after "bb"
Due to the splitting you probably can't construct your state machine with the correct numbers in the first place. But it's always trivial to renumber states after the fact using an array:

  Start with an array mapping each number to itself.
    [0, 1, 2, 3, 4, 5, 6, 7]
  For each state that needs a specific number, swap the numbers at those indices:
    For example, if we need 5 and 7 to be accept states, we would have:
    [0, 5, 7, 3, 4, 1, 6, 2]
  Optionally, sort your non-accept states so your table still looks pretty:
    [0, 5, 7, 1, 2, 3, 4, 6]
  Walk the states updating their contents according to the array.
  Finally, apply the permutation to the states themselves according to the array, mutating the array as you go.
    Clearly, by performing the same operation on both arrays we get the intended effect once the index array is sorted again.
    (That said, if you find this confusing, you can just do it while copying instead of in-place)