frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Penisgate erupts at Olympics; scandal exposes risks of bulking your bulge

https://arstechnica.com/health/2026/02/penisgate-erupts-at-olympics-scandal-exposes-risks-of-bulk...
1•Bender•28s ago•0 comments

Arcan Explained: A browser for different webs

https://arcan-fe.com/2026/01/26/arcan-explained-a-browser-for-different-webs/
1•fanf2•2m ago•0 comments

What did we learn from the AI Village in 2025?

https://theaidigest.org/village/blog/what-we-learned-2025
1•mrkO99•2m ago•0 comments

An open replacement for the IBM 3174 Establishment Controller

https://github.com/lowobservable/oec
1•bri3d•4m ago•0 comments

The P in PGP isn't for pain: encrypting emails in the browser

https://ckardaris.github.io/blog/2026/02/07/encrypted-email.html
2•ckardaris•7m ago•0 comments

Show HN: Mirror Parliament where users vote on top of politicians and draft laws

https://github.com/fokdelafons/lustra
1•fokdelafons•7m ago•1 comments

Ask HN: Opus 4.6 ignoring instructions, how to use 4.5 in Claude Code instead?

1•Chance-Device•9m ago•0 comments

We Mourn Our Craft

https://nolanlawson.com/2026/02/07/we-mourn-our-craft/
1•ColinWright•11m ago•0 comments

Jim Fan calls pixels the ultimate motor controller

https://robotsandstartups.substack.com/p/humanoids-platform-urdf-kitchen-nvidias
1•robotlaunch•15m ago•0 comments

Exploring a Modern SMTPE 2110 Broadcast Truck with My Dad

https://www.jeffgeerling.com/blog/2026/exploring-a-modern-smpte-2110-broadcast-truck-with-my-dad/
1•HotGarbage•15m ago•0 comments

AI UX Playground: Real-world examples of AI interaction design

https://www.aiuxplayground.com/
1•javiercr•16m ago•0 comments

The Field Guide to Design Futures

https://designfutures.guide/
1•andyjohnson0•16m ago•0 comments

The Other Leverage in Software and AI

https://tomtunguz.com/the-other-leverage-in-software-and-ai/
1•gmays•18m ago•0 comments

AUR malware scanner written in Rust

https://github.com/Sohimaster/traur
3•sohimaster•20m ago•1 comments

Free FFmpeg API [video]

https://www.youtube.com/watch?v=6RAuSVa4MLI
3•harshalone•20m ago•1 comments

Are AI agents ready for the workplace? A new benchmark raises doubts

https://techcrunch.com/2026/01/22/are-ai-agents-ready-for-the-workplace-a-new-benchmark-raises-do...
2•PaulHoule•25m ago•0 comments

Show HN: AI Watermark and Stego Scanner

https://ulrischa.github.io/AIWatermarkDetector/
1•ulrischa•26m ago•0 comments

Clarity vs. complexity: the invisible work of subtraction

https://www.alexscamp.com/p/clarity-vs-complexity-the-invisible
1•dovhyi•27m ago•0 comments

Solid-State Freezer Needs No Refrigerants

https://spectrum.ieee.org/subzero-elastocaloric-cooling
2•Brajeshwar•27m ago•0 comments

Ask HN: Will LLMs/AI Decrease Human Intelligence and Make Expertise a Commodity?

1•mc-0•29m ago•1 comments

From Zero to Hero: A Brief Introduction to Spring Boot

https://jcob-sikorski.github.io/me/writing/from-zero-to-hello-world-spring-boot
1•jcob_sikorski•29m ago•1 comments

NSA detected phone call between foreign intelligence and person close to Trump

https://www.theguardian.com/us-news/2026/feb/07/nsa-foreign-intelligence-trump-whistleblower
12•c420•29m ago•2 comments

How to Fake a Robotics Result

https://itcanthink.substack.com/p/how-to-fake-a-robotics-result
1•ai_critic•30m ago•0 comments

It's time for the world to boycott the US

https://www.aljazeera.com/opinions/2026/2/5/its-time-for-the-world-to-boycott-the-us
3•HotGarbage•30m ago•0 comments

Show HN: Semantic Search for terminal commands in the Browser (No Back end)

https://jslambda.github.io/tldr-vsearch/
1•jslambda•30m ago•1 comments

The AI CEO Experiment

https://yukicapital.com/blog/the-ai-ceo-experiment/
2•romainsimon•32m ago•0 comments

Speed up responses with fast mode

https://code.claude.com/docs/en/fast-mode
5•surprisetalk•35m ago•1 comments

MS-DOS game copy protection and cracks

https://www.dosdays.co.uk/topics/game_cracks.php
4•TheCraiggers•36m ago•0 comments

Updates on GNU/Hurd progress [video]

https://fosdem.org/2026/schedule/event/7FZXHF-updates_on_gnuhurd_progress_rump_drivers_64bit_smp_...
2•birdculture•37m ago•0 comments

Epstein took a photo of his 2015 dinner with Zuckerberg and Musk

https://xcancel.com/search?f=tweets&q=davenewworld_2%2Fstatus%2F2020128223850316274
14•doener•38m ago•2 comments
Open in hackernews

How I wrote JustHTML, a Python-based HTML5 parser, using coding agents

https://friendlybit.com/python/writing-justhtml-with-coding-agents/
58•simonw•1mo ago

Comments

simonw•1mo ago
JustHTML https://github.com/EmilStenstrom/justhtml is a neat new Python library - it implements a compliant HTML5 parser in ~3,000 lines of code that passes the full existing 9,200 test HTML5 conformance suite.

Emil Stenström wrote it with a variety of coding agent tools over the course of a couple of months. It's a really interesting case study in using coding agents to take on a very challenging project, taking advantage of their ability to iterate against existing tests.

I wrote a bit more about it here: https://simonwillison.net/2025/Dec/14/justhtml/

EmilStenstrom•1mo ago
Thanks for sharing simon! Writing a parser is a really good job for a coding agent, because there's a clear right/wrong answer. In this case, the path there is the challenging part. The hours I've spent trying to convince agents to implement adoption agency well... :)
msephton•1mo ago
RSS on website is erroring. I'd like to follow!
EmilStenstrom•1mo ago
Thanks! Now fixed.
gabrielsroka•1mo ago
> 3,000 loc

I cloned the repo and ran `wc -l` on the src directory and got closer to 9,500. Am i missing something?

Edit: maybe you meant just the parser

HPsquared•1mo ago
Better to use something like `cloc` which excludes blank and comment lines.
minusf•1mo ago
while it's mentioned in the post, it seems to me a bit burried:

isn't this more like a port of `html5ever` from rust to python using LLM, as opposed to creating something "new" based on the test suite alone?

if yes, wouldn't be the distinction rather important?

EmilStenstrom•1mo ago
Depending on your perspective, you can take away any of the two points.

The first iteration of the project created a library from scratch, from the tests all the way to 100% test coverage. So even without the second iteration, it's still possible to create something new.

In an attempt to speed it up, I (with coding agent) rewrote it again based on html5ever's code structure. It's far from a clean port, because it's heavily optimized Rust code, that isn't possible to port to Python (Rust marcos). And it still depended on a lot of iteration and rerunning tests to get it anywhere.

I'm not pushing any agenda here, you're free to take what you want from it!

minusf•1mo ago
Thank you for the clarification, that was not entirely clear to me from the post.

You also mention that the current "optimised" version is "good enough" for every-day use (I use `bs4` for working with html), was the first iteration also usable in that way? Did you look at `html5ever` because the LLM hit a wall trying to speed it up?

EmilStenstrom•1mo ago
It was usable! Yeah, the handler based architecture that I had built on was very dependent on object lookups and method calls, and my hunch was that I had hit a wall trying to optimize the speed. I was slower than html5lib still, so decided to go with another "code architecture" (html5ever) that was closer to the metal. Worked out in getting me ~60% faster than html5lib.

As for bs4, if you don't change the default, you get the stdlib html.parser, which doesn't implement html5. Only works for valid HTML.

simonw•1mo ago
I just had Codex CLI figure out where that first version ended and the new one began.

It looks to me like this is the last commit before the rewrite: https://github.com/EmilStenstrom/justhtml/tree/989b70818874d...

The commit after that is https://github.com/EmilStenstrom/justhtml/commit/7bab3d2 "radical: replace legacy TurboHTML tree/handler stack with new tokenizer + treebuilder scaffold"

It also adds this document called html5ever_port_plan.md: https://github.com/EmilStenstrom/justhtml/blob/7bab3d22c0da0...

Here's the Codex CLI transcript I used to figure this out: https://gistpreview.github.io/?53202706d137c82dce87d729263df...

vivzkestrel•1mo ago
if it isnt too much to ask, since you are already insanely familiar with the html parser semantics, can you write a postgres extension that can parse html inside postgres? usecase: cleaning rss feed items while storing
EmilStenstrom•1mo ago
The license is MIT, so feel free to expand this any way you want! No need to write a new parser from scratch.
furyofantares•1mo ago
Is it really too much to do a little more editing of the LLM output for the blog post? There's 17 numbered and titled section headings, all of which are linkable to with anchors, and which mostly have two sentences each.
EmilStenstrom•1mo ago
Hi! Yes, the headers were LLM generated and the text were not. I didn't want the blog post to go on for ages, so I just wrote a few lines under each heading. Any ideas how to make it better, while not being too long?
furyofantares•1mo ago
I'd start by deleting all the numbered section headings, and add either a transition word (then, so) or a transition sentence (why you went from step n to step n+1 or after how much time or whatnot).
EmilStenstrom•1mo ago
New iteration up. I kept the headings because they make the text easier to scan, but made them more descriptive. Added some transition words. Slight improvement I think.
Aloisius•1mo ago
I'm not seeing 100% pass rates.

    $ uv run run_tests.py --check-errors -v

    FAILED: 8337/9404 passed (88.6%), 13 skipped
It seems this the parser is creating errors even when none are expected:

    === INCOMING HTML ===
    <math><mi></mi></math>

    === EXPECTED ERRORS ===
    (none)

    === ACTUAL ERRORS ===
    (1,12): unexpected-null-character
    (1,1): expected-doctype-but-got-start-tag
    (1,11): invalid-codepoint
This "passes" because the output tree still matches the expected output, but it is clearly not correct.

The test suite also doesn't seem to be checking errors for large swaths of the html5 test suite even with --check-errors, so it's hard to say how many would pass if those were checked.

EmilStenstrom•1mo ago
Hi! The expected errors are not standardized enough for it to make sense to enable --check-errors by default. If you look at the readme, you'll see that the only thing they're checking is that the _numbers of errors_ are correct.

That said, the example you are pulling our out does not match that either. I'll make sure to fix this bug and other like it! https://github.com/EmilStenstrom/justhtml/issues/20

Aloisius•1mo ago
run_tests.py does not appear to be checking the number of errors or the errors themselves for the tokenizer, encoding or serializer tests from html5lib-tests - which represent the majority of tests.

There's also something off about your benchmark comparison. If one runs pytest on html5lib, which uses html5lib-test plus its own unit tests and does check if errors match exactly, the pass rate appears to be much higher than 86%:

    $ uv run pytest -v 
    17500 passed, 15885 skipped, 683 xfailed,
These numbers are inflated because html5lib-tests/tree-construction tests are run multiple times in different configurations. Many of the expected failures appear to be script tests similar to the ones JustHTML skips.
EmilStenstrom•1mo ago
Excellent feedback. I'll have a look at the running of html5lib tests again.
EmilStenstrom•1mo ago
I've checked the numbers for html5lib, and they are correct. They are skipping a load of tests for many different reasons, one being that namespacing of svg/math fragments are not implemented. The 88% number listed is correct.
EmilStenstrom•1mo ago
Thanks for flagging this. Found multiple errors that are now fixed:

- The quoted test comes from justhtml-tests, a custom test suite added to make sure all parts of the algorithm are tested. It is not part of html5lib-tests.

- html5lib-tests does not support control characters in tests, which is why some of the tests in justhtml-tests exist in the first place. In my test suite I have added that ability to our test runner to make sure we handle control character correctly.

- In the INCOMING HTML block above, we are not printing control characters, they get filtered away in the terminal

- Both the treebuilder and the tokenizer are outputting errors for the found control character. None of them are in the right location (at flush instead of where found), and they are also duplicate.

- This being my own test suite, I haven't specified the correct errors. I should. expected-doctype-but-got-start-tag is reasonable in this case.

All of the above bugs are now fixed, and the test suite is in a better shape. Thanks again!

JackSlateur•1mo ago
What an ugly code (I read encoding.py)