frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

How I wrote JustHTML, a Python-based HTML5 parser, using coding agents

https://friendlybit.com/python/writing-justhtml-with-coding-agents/
40•simonw•4d ago

Comments

simonw•4d ago
JustHTML https://github.com/EmilStenstrom/justhtml is a neat new Python library - it implements a compliant HTML5 parser in ~3,000 lines of code that passes the full existing 9,200 test HTML5 conformance suite.

Emil Stenström wrote it with a variety of coding agent tools over the course of a couple of months. It's a really interesting case study in using coding agents to take on a very challenging project, taking advantage of their ability to iterate against existing tests.

I wrote a bit more about it here: https://simonwillison.net/2025/Dec/14/justhtml/

EmilStenstrom•4d ago
Thanks for sharing simon! Writing a parser is a really good job for a coding agent, because there's a clear right/wrong answer. In this case, the path there is the challenging part. The hours I've spent trying to convince agents to implement adoption agency well... :)
msephton•3d ago
RSS on website is erroring. I'd like to follow!
EmilStenstrom•3d ago
Thanks! Now fixed.
gabrielsroka•4d ago
> 3,000 loc

I cloned the repo and ran `wc -l` on the src directory and got closer to 9,500. Am i missing something?

Edit: maybe you meant just the parser

HPsquared•1h ago
Better to use something like `cloc` which excludes blank and comment lines.
minusf•4d ago
while it's mentioned in the post, it seems to me a bit burried:

isn't this more like a port of `html5ever` from rust to python using LLM, as opposed to creating something "new" based on the test suite alone?

if yes, wouldn't be the distinction rather important?

EmilStenstrom•4d ago
Depending on your perspective, you can take away any of the two points.

The first iteration of the project created a library from scratch, from the tests all the way to 100% test coverage. So even without the second iteration, it's still possible to create something new.

In an attempt to speed it up, I (with coding agent) rewrote it again based on html5ever's code structure. It's far from a clean port, because it's heavily optimized Rust code, that isn't possible to port to Python (Rust marcos). And it still depended on a lot of iteration and rerunning tests to get it anywhere.

I'm not pushing any agenda here, you're free to take what you want from it!

minusf•4d ago
Thank you for the clarification, that was not entirely clear to me from the post.

You also mention that the current "optimised" version is "good enough" for every-day use (I use `bs4` for working with html), was the first iteration also usable in that way? Did you look at `html5ever` because the LLM hit a wall trying to speed it up?

EmilStenstrom•4d ago
It was usable! Yeah, the handler based architecture that I had built on was very dependent on object lookups and method calls, and my hunch was that I had hit a wall trying to optimize the speed. I was slower than html5lib still, so decided to go with another "code architecture" (html5ever) that was closer to the metal. Worked out in getting me ~60% faster than html5lib.

As for bs4, if you don't change the default, you get the stdlib html.parser, which doesn't implement html5. Only works for valid HTML.

simonw•4d ago
I just had Codex CLI figure out where that first version ended and the new one began.

It looks to me like this is the last commit before the rewrite: https://github.com/EmilStenstrom/justhtml/tree/989b70818874d...

The commit after that is https://github.com/EmilStenstrom/justhtml/commit/7bab3d2 "radical: replace legacy TurboHTML tree/handler stack with new tokenizer + treebuilder scaffold"

It also adds this document called html5ever_port_plan.md: https://github.com/EmilStenstrom/justhtml/blob/7bab3d22c0da0...

Here's the Codex CLI transcript I used to figure this out: https://gistpreview.github.io/?53202706d137c82dce87d729263df...

vivzkestrel•3d ago
if it isnt too much to ask, since you are already insanely familiar with the html parser semantics, can you write a postgres extension that can parse html inside postgres? usecase: cleaning rss feed items while storing
EmilStenstrom•3d ago
The license is MIT, so feel free to expand this any way you want! No need to write a new parser from scratch.
furyofantares•3d ago
Is it really too much to do a little more editing of the LLM output for the blog post? There's 17 numbered and titled section headings, all of which are linkable to with anchors, and which mostly have two sentences each.
EmilStenstrom•3d ago
Hi! Yes, the headers were LLM generated and the text were not. I didn't want the blog post to go on for ages, so I just wrote a few lines under each heading. Any ideas how to make it better, while not being too long?
furyofantares•3d ago
I'd start by deleting all the numbered section headings, and add either a transition word (then, so) or a transition sentence (why you went from step n to step n+1 or after how much time or whatnot).
EmilStenstrom•3d ago
New iteration up. I kept the headings because they make the text easier to scan, but made them more descriptive. Added some transition words. Slight improvement I think.
Aloisius•9m ago
> Copyright (c) 2025 Emil Stenström

If you're not the author, you don't own the copyright.

Beginning January 2026, all ACM publications will be made open access

https://dl.acm.org/openaccess
1145•Kerrick•7h ago•128 comments

We pwned X, Vercel, Cursor, and Discord through a supply-chain attack

https://gist.github.com/hackermondev/5e2cdc32849405fff6b46957747a2d28
431•hackermondev•3h ago•170 comments

GPT-5.2-Codex

https://openai.com/index/introducing-gpt-5-2-codex/
293•meetpateltech•4h ago•170 comments

Texas is suing all of the big TV makers for spying on what you watch

https://www.theverge.com/news/845400/texas-tv-makers-lawsuit-samsung-sony-lg-hisense-tcl-spying
319•tortilla•2d ago•178 comments

How China built its ‘Manhattan Project’ to rival the West in AI chips

https://www.japantimes.co.jp/business/2025/12/18/tech/china-west-ai-chips/
126•artninja1988•4h ago•110 comments

Skills for organizations, partners, the ecosystem

https://claude.com/blog/organization-skills-and-directory
211•adocomplete•5h ago•134 comments

Classical statues were not painted horribly

https://worksinprogress.co/issue/were-classical-statues-painted-horribly/
509•bensouthwood•10h ago•253 comments

T5Gemma 2: The next generation of encoder-decoder models

https://blog.google/technology/developers/t5gemma-2/
69•milomg•3h ago•10 comments

Two kinds of vibe coding

https://davidbau.com/archives/2025/12/16/vibe_coding.html
30•jxmorris12•1h ago•12 comments

Delty (YC X25) Is Hiring an ML Engineer

https://www.ycombinator.com/companies/delty/jobs/MDeC49o-machine-learning-engineer
1•lalitkundu•1h ago

The Legacy of Nicaea

https://hedgehogreview.com/web-features/thr/posts/the-legacy-of-nicaea
17•diodorus•5d ago•0 comments

How did IRC ping timeouts end up in a lawsuit?

https://mjg59.dreamwidth.org/73777.html
99•dvaun•1d ago•11 comments

Show HN: Picknplace.js, an alternative to drag-and-drop

https://jgthms.com/picknplace.js/
72•bbx•2d ago•47 comments

The Scottish Highlands, the Appalachians, Atlas are the same mountain range

https://vividmaps.com/central-pangean-mountains/
59•lifeisstillgood•3h ago•15 comments

FunctionGemma 270M Model

https://blog.google/technology/developers/functiongemma/
117•mariobm•4h ago•33 comments

1.5 TB of VRAM on Mac Studio – RDMA over Thunderbolt 5

https://www.jeffgeerling.com/blog/2025/15-tb-vram-on-mac-studio-rdma-over-thunderbolt-5
6•rbanffy•37m ago•0 comments

TRELLIS.2: state-of-the-art large 3D generative model (4B)

https://github.com/microsoft/TRELLIS.2
50•dvrp•2d ago•10 comments

Firefox will have an option to disable all AI features

https://mastodon.social/@firefoxwebdevs/115740500373677782
186•twapi•4h ago•172 comments

Show HN: Stop AI scrapers from hammering your self-hosted blog (using porn)

https://github.com/vivienhenz24/fuzzy-canary
86•misterchocolat•2d ago•53 comments

Your job is to deliver code you have proven to work

https://simonwillison.net/2025/Dec/18/code-proven-to-work/
563•simonw•8h ago•480 comments

Meta Segment Anything Model Audio

https://ai.meta.com/samaudio/
110•megaman821•2d ago•14 comments

Oliver Sacks put himself into his case studies – what was the cost?

https://www.newyorker.com/magazine/2025/12/15/oliver-sacks-put-himself-into-his-case-studies-what...
22•barry-cotter•2h ago•61 comments

How to hack Discord, Vercel and more with one easy trick

https://kibty.town/blog/mintlify/
74•todsacerdoti•3h ago•14 comments

I've been writing ring buffers wrong all these years (2016)

https://www.snellman.net/blog/archive/2016-12-13-ring-buffers/
39•flaghacker•2d ago•18 comments

Using TypeScript to obtain one of the rarest license plates

https://www.jack.bio/blog/licenseplate
125•lafond•8h ago•133 comments

AI Vending Machine Was Tricked into Giving Away Everything

https://kottke.org/25/12/this-ai-vending-machine-was-tricked-into-giving-away-everything
17•duggan•1h ago•1 comments

Please just try HTMX

http://pleasejusttryhtmx.com/
393•iNic•8h ago•331 comments

The <time> element should do something

https://nolanlawson.com/2025/12/14/the-time-element-should-actually-do-something/
51•birdculture•2d ago•16 comments

The immortality of Microsoft Word

https://theredline.versionstory.com/p/on-the-immortality-of-microsoft-word
33•jpbryan•7h ago•48 comments

Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction

31•sidmanchkanti21•7h ago•34 comments