frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Show HN: Defuddle, an HTML-to-Markdown alternative to Readability

https://github.com/kepano/defuddle
158•kepano•6h ago
Defuddle is an open-source JS library I built to parse and extract the main content and metadata from web pages. It can also return the content as Markdown.

I built Defuddle while working on Obsidian Web Clipper[1] (also MIT-licensed) because Mozilla's Readability[2] appears to be mostly abandoned, and didn't work well for many sites.

It's still very much a work in progress, but I thought I'd share it today, in light of the announcement that Mozilla is shutting down Pocket. This library could be helpful to anyone building a read-it-later app.

Defuddle is also available as a CLI:

https://github.com/kepano/defuddle-cli

[1] https://github.com/obsidianmd/obsidian-clipper

[2] https://github.com/mozilla/readability

Comments

busymom0•6h ago
In the playground, after I enter a url, I can't seem to figure out how to submit it to fetch the url? I tried pressing the return key on iOS keyboard but it didn't do anything. Am I missing something?
kepano•6h ago
The input is there to test the url option — which I admit is a bit confusing, so I have removed it for now. I haven't found a good and free way to proxy requests from a GitHub page (yet).
rcarmo•6h ago
The Python analogues seem to be well maintained. I did my own implementation of the Readability algorithm years ago and dropped it in favor them, and I have a few scrapers going strong with regular updates.
kepano•6h ago
Are there any in particular you can recommend?
khimaros•5h ago
not parent, but this one looks maintained https://github.com/buriy/python-readability
fkfyshroglk•5h ago
For those not in the know: [Readability](https://github.com/mozilla/readability)
billconan•5h ago
Are you using ai models behind the scenes? I saw Gemini and others in the code. I am asking mainly to understand the cost of using yours vs. readability. Thank!
kepano•5h ago
No it's all rules-based. I think the code you're referring to is "extractors", which are website-specific rules that I'm working on to standardize the output from sites with comments threads (e.g. HN, Reddit) and conversational chats (ChatGPT, Claude, Gemini).
pugio•51m ago
I would love something which reliably extracted a markdown back/forth from all the main LLM providers. I tried `defuddle` on a shared Gemini URL and it returned nothing but the "Sign In" link. Maybe I'm using your extractor wrong? How are you managing to get the rendered conversation HTML?
tmpfs•5h ago
Interesting as I was researching this recently and certainly not impressed with the quality of the Readability implementations in various languages. Although Readability.js was clearly the best, it being Javascript didn't suit my project.

In the end I found the python trifatura library to extract the best quality content with accurate meta data.

You might want to compare your implementation to trifatura to see if there is room for improvement.

fabmilo•2h ago
reference to the library: https://trafilatura.readthedocs.io/en/latest/

for the curious: Trafilatura means "extrusion" in Italian.

| This method creates a porous surface that distinguishes pasta trafilata for its extraordinary way of holding the sauce. search maccheroni trafilati vs maccheroni lisci :)

(btw I think you meant trafilatura not trifatura)

thm•9m ago
Been using it since day one but development has stalled quite a bit since 2.0.0.
acrophobic•15m ago
> ...it being Javascript didn't suit my project.

If you're using Go, I maintain Go ports of Readability[0] and Trafilatura[1]. They're actively maintained, and for Trafilatura, the extraction performance is comparable to the Python version.

[0]: https://github.com/go-shiori/go-readability

[1]: https://github.com/markusmobius/go-trafilatura

creakingstairs•4h ago
I was just looking at obsidian web-clipper's source code because I've been quite impressed at its markdown conversion results and came across Defuddle in there. I'll be using for my bespoke read-it-later/ knowledge-base app, so thank you in advance :D
input_sh•4h ago
A bit off-topic, but I'm very excited to see the launch of Bases! I've obsessively followed the roadmap for like a year awaiting this day and have been frequently disappointed to still see it stuck somewhere under "planned".

Not that I didn't already implement a read-it-later solution with Obsidian+Dataview, but this definitely makes things simpler!

jeanlucas•2h ago
Didn't it release just some days ago?
sn0n•49s ago
Bases?
inhumantsar•4h ago
can confirm that readability seems to be on life support. I used it slurp, an obsidian plugin which serves the same basic purpose as web clipper, and always had a hard time getting PRs reviewed and merged.

i started working on my own alternative but life (and web clipper) derailed the work.

it's funny. somehow slurp keeps gaining new users even though web clipper exists. so i might have to refactor it to use your library sometime soon even though I don't use slurp myself anymore.

90s_dev•3h ago
Neat. With ~3 more lines of code, you could get a URL and render it in simpler HTML and be a full fledged replacement.
khaki54•2h ago
seems pretty much perfect including obsidian clipper. Thanks!
shrinks99•2h ago
I've been super happy with Obsidian Web Clipper! It's worked really well for me with the one exception of importing publish dates (which is more than forgivable !)
jeanlucas•2h ago
Obsidian Web Clipper is a great tool to turn chatGPT conversations in markdown, or to just print it (believe me, it is a user case)
T0Bi•1h ago
I just ask ChatGPT to provide the summary or whatever I need as a markdown file.
acrophobic•1h ago
Is Mozilla's Readability really abandoned? The latest release (v0.6.0) is just 2 months ago, and its maintainer (Gijs) is pretty active on responding issues.
Tsarp•39m ago
Been using the obsidian clipper since it was out and this is a really neat. The per website profile based extraction is awesome.

Even if you are not a obsidian user, the markdown extraction quality is the most reliable Ive seen.

Silly job interview questions in Haskell

https://chrispenner.ca/posts/interview
21•behnamoh•1h ago•5 comments

Show HN: Defuddle, an HTML-to-Markdown alternative to Readability

https://github.com/kepano/defuddle
158•kepano•6h ago•37 comments

The Future of Flatpak

https://lwn.net/Articles/1020571/
153•dxs•4h ago•66 comments

Claude 4

https://www.anthropic.com/news/claude-4
1545•meetpateltech•11h ago•873 comments

32 bits that changed microprocessor design

https://spectrum.ieee.org/bellmac-32-ieee-milestone
52•mdp2021•4h ago•5 comments

That fractal that's been up on my wall for years

https://chriskw.xyz/2025/05/21/Fractal/
335•chriskw•12h ago•22 comments

Airport for DuckDB

https://airport.query.farm/
61•jonbaer•3d ago•10 comments

Does Earth have two high-tide bulges on opposite sides? (2014)

http://physics.stackexchange.com/questions/121830/does-earth-really-have-two-high-tide-bulges-on-opposite-sides
144•imurray•8h ago•47 comments

“Secret Mall Apartment,” a Protest for Place

https://modernagejournal.com/secret-mall-apartment-a-protest-for-place/251023/
69•rufus_foreman•5h ago•37 comments

Mozilla to shut down Pocket and Fakespot

https://support.mozilla.org/en-US/kb/future-of-pocket
831•phantomathkg•11h ago•526 comments

CRDTs #2: Turtles All the Way Down

https://jhellerstein.github.io/blog/crdt-turtles/
13•pfarago•1h ago•0 comments

How to cheat at settlers by loading the dice (2017)

https://izbicki.me/blog/how-to-cheat-at-settlers-of-catan-by-loading-the-dice-and-prove-it-with-p-values.html
90•jxmorris12•9h ago•75 comments

Improving performance of rav1d video decoder

https://ohadravid.github.io/posts/2025-05-rav1d-faster/
257•todsacerdoti•15h ago•88 comments

Richard Garwin’s role in designing the hydrogen bomb was obscured

https://www.nytimes.com/2025/05/19/science/richard-garwin-hydrogen-bomb.html
39•LAsteNERD•3d ago•9 comments

Loading Pydantic models from JSON without running out of memory

https://pythonspeed.com/articles/pydantic-json-memory/
84•itamarst•9h ago•29 comments

Sketchy Calendar

https://www.inkandswitch.com/ink/notes/sketchy-calendar/
33•surprisetalk•4h ago•4 comments

Fast Allocations in Ruby 3.5

https://railsatscale.com/2025-05-21-fast-allocations-in-ruby-3-5/
193•tekknolagi•13h ago•42 comments

Ancient law requires a bale of straw to hang from Charing Cross rail bridge

https://www.ianvisits.co.uk/articles/ancient-law-requires-a-bale-of-hay-to-hang-from-charing-cross-rail-bridge-81318/
50•alexbilbie•19h ago•48 comments

I Built My Own Audio Player

https://nexo.sh/posts/why-i-built-a-native-mp3-player-in-swiftui/
188•nexo-v1•13h ago•97 comments

A South Korean grand master on the art of the perfect soy sauce

https://www.theguardian.com/world/2025/may/21/without-time-there-is-no-flavour-a-south-korean-grand-master-on-the-art-of-the-perfect-soy-sauce
137•n1b0m•1d ago•103 comments

We’ll be ending web hosting for your apps on Glitch

https://blog.glitch.com/post/changes-are-coming-to-glitch/
74•js4ever•10h ago•43 comments

Launch HN: WorkDone (YC X25) – AI Audit of Medical Charts

59•digitaltzar•12h ago•51 comments

1,145 pull requests per day

https://saile.it/1145-pull-requests-per-day/
31•sailE•8h ago•21 comments

When a team is too big

https://blog.alexewerlof.com/p/when-a-team-is-too-big
53•gpi•3d ago•54 comments

Management = Bullshit (LLM Edition)

http://funcall.blogspot.com/2025/05/management-bullshit.html
17•dxs•4h ago•13 comments

W.a.s.t.e. Not: John Scanlan looks for the future in the dustbins of history

https://thebaffler.com/latest/w-a-s-t-e-not-adams
3•Thevet•3d ago•0 comments

When good pseudorandom numbers go bad

https://blog.djnavarro.net/posts/2025-05-18_multivariate-normal-sampling-floating-point/
46•chewxy•3d ago•7 comments

Trade Secrecy in Willy Wonka's Chocolate Factory (2009)

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1430463
34•NaOH•7h ago•9 comments

Show HN: SQLite JavaScript - extend your database with JavaScript

https://github.com/sqliteai/sqlite-js
145•marcobambini•14h ago•44 comments

Tab Roving – focus management for element groups

https://nik.digital/posts/tab-roving
4•samwho•3d ago•0 comments