frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Defuddle, an HTML-to-Markdown alternative to Readability

https://github.com/kepano/defuddle
418•kepano•10mo ago
Defuddle is an open-source JS library I built to parse and extract the main content and metadata from web pages. It can also return the content as Markdown.

I built Defuddle while working on Obsidian Web Clipper[1] (also MIT-licensed) because Mozilla's Readability[2] appears to be mostly abandoned, and didn't work well for many sites.

It's still very much a work in progress, but I thought I'd share it today, in light of the announcement that Mozilla is shutting down Pocket. This library could be helpful to anyone building a read-it-later app.

Defuddle is also available as a CLI:

https://github.com/kepano/defuddle-cli

[1] https://github.com/obsidianmd/obsidian-clipper

[2] https://github.com/mozilla/readability

Comments

busymom0•10mo ago
In the playground, after I enter a url, I can't seem to figure out how to submit it to fetch the url? I tried pressing the return key on iOS keyboard but it didn't do anything. Am I missing something?
kepano•10mo ago
The input is there to test the url option — which I admit is a bit confusing, so I have removed it for now. I haven't found a good and free way to proxy requests from a GitHub page (yet).
rcarmo•10mo ago
The Python analogues seem to be well maintained. I did my own implementation of the Readability algorithm years ago and dropped it in favor them, and I have a few scrapers going strong with regular updates.
kepano•10mo ago
Are there any in particular you can recommend?
khimaros•10mo ago
not parent, but this one looks maintained https://github.com/buriy/python-readability
fkfyshroglk•10mo ago
For those not in the know: [Readability](https://github.com/mozilla/readability)
billconan•10mo ago
Are you using ai models behind the scenes? I saw Gemini and others in the code. I am asking mainly to understand the cost of using yours vs. readability. Thank!
kepano•10mo ago
No it's all rules-based. I think the code you're referring to is "extractors", which are website-specific rules that I'm working on to standardize the output from sites with comments threads (e.g. HN, Reddit) and conversational chats (ChatGPT, Claude, Gemini).
pugio•10mo ago
I would love something which reliably extracted a markdown back/forth from all the main LLM providers. I tried `defuddle` on a shared Gemini URL and it returned nothing but the "Sign In" link. Maybe I'm using your extractor wrong? How are you managing to get the rendered conversation HTML?
bambax•10mo ago
I think most LLM APIs return markdown and the conversion md->html happens after; so if you query the API directly you get markdown "for free".
tmpfs•10mo ago
Interesting as I was researching this recently and certainly not impressed with the quality of the Readability implementations in various languages. Although Readability.js was clearly the best, it being Javascript didn't suit my project.

In the end I found the python trifatura library to extract the best quality content with accurate meta data.

You might want to compare your implementation to trifatura to see if there is room for improvement.

fabmilo•10mo ago
reference to the library: https://trafilatura.readthedocs.io/en/latest/

for the curious: Trafilatura means "extrusion" in Italian.

| This method creates a porous surface that distinguishes pasta trafilata for its extraordinary way of holding the sauce. search maccheroni trafilati vs maccheroni lisci :)

(btw I think you meant trafilatura not trifatura)

thm•10mo ago
Been using it since day one but development has stalled quite a bit since 2.0.0.
acrophobic•10mo ago
> ...it being Javascript didn't suit my project.

If you're using Go, I maintain Go ports of Readability[0] and Trafilatura[1]. They're actively maintained, and for Trafilatura, the extraction performance is comparable to the Python version.

[0]: https://github.com/go-shiori/go-readability

[1]: https://github.com/markusmobius/go-trafilatura

breadchris•10mo ago
this is what i came here to see, thanks!
derekperkins•10mo ago
We've been active users of go-trafilatura and love it
winddude•10mo ago
It's a bit old, but I bench marked a number of the web extraction tools years ago, https://github.com/Nootka-io/wee-benchmarking-tool, resiliparse-plain was my clear winner at the time.
creakingstairs•10mo ago
I was just looking at obsidian web-clipper's source code because I've been quite impressed at its markdown conversion results and came across Defuddle in there. I'll be using for my bespoke read-it-later/ knowledge-base app, so thank you in advance :D
input_sh•10mo ago
A bit off-topic, but I'm very excited to see the launch of Bases! I've obsessively followed the roadmap for like a year awaiting this day and have been frequently disappointed to still see it stuck somewhere under "planned".

Not that I didn't already implement a read-it-later solution with Obsidian+Dataview, but this definitely makes things simpler!

jeanlucas•10mo ago
Didn't it release just some days ago?
sn0n•10mo ago
Bases?
input_sh•10mo ago
https://help.obsidian.md/bases

Note that I'm using a preview (catalyst) version, it will reach stable soon. I'm assuming kepano will submit it here then.

inhumantsar•10mo ago
can confirm that readability seems to be on life support. I used it slurp, an obsidian plugin which serves the same basic purpose as web clipper, and always had a hard time getting PRs reviewed and merged.

i started working on my own alternative but life (and web clipper) derailed the work.

it's funny. somehow slurp keeps gaining new users even though web clipper exists. so i might have to refactor it to use your library sometime soon even though I don't use slurp myself anymore.

90s_dev•10mo ago
Neat. With ~3 more lines of code, you could get a URL and render it in simpler HTML and be a full fledged replacement.
khaki54•10mo ago
seems pretty much perfect including obsidian clipper. Thanks!
shrinks99•10mo ago
I've been super happy with Obsidian Web Clipper! It's worked really well for me with the one exception of importing publish dates (which is more than forgivable !)
jeanlucas•10mo ago
Obsidian Web Clipper is a great tool to turn chatGPT conversations in markdown, or to just print it (believe me, it is a user case)
T0Bi•10mo ago
I just ask ChatGPT to provide the summary or whatever I need as a markdown file.
emaro•10mo ago
Not sure about other clients, but Kagi Assistant directly offers to save a conversation as Markdown. Using Obsidian's web-clipper is a good idea too though.
kouru225•10mo ago
Is that a paid plugin?
jeanlucas•10mo ago
It is free and open source: https://github.com/obsidianmd/obsidian-clipper
acrophobic•10mo ago
Is Mozilla's Readability really abandoned? The latest release (v0.6.0) is just 2 months ago, and its maintainer (Gijs) is pretty active on responding issues.
khasan222•10mo ago
That codebase definitely leaves much to be desired, I’ve already had to fork it for work in order to fix some bugs.

1 such bug, find a foreign language with commas in between numbers instead of periods, like Dutch(I think), and a lot of prices on the page. It’ll think all the numbers are relevant text.

And of course I tried to open a pr and get it merged, but they require tests, and of course the tests don’t work on the page Im testing. It’s just very snafu imho

fabrice_d•10mo ago
This seems to be https://github.com/mozilla/readability/pull/853#issuecomment... and I think their expectations are pretty reasonable.
khasan222•10mo ago
Meh, maybe I'm standing too close to the problem, Idk. It is always frustrating trying to use a tool, and it not work though. I know it's free and all, but then I feel like helping people make good contributions is paramount in maintaining and fixing bugs.

Clearly the comma thing is a bug, it's the lack of wanting to fix it actually that is a bit disheartening, and why I think it is a deadish repo

fabrice_d•10mo ago
I don't know how you can interpret "we'd really like to make sure that the patch works and that we don't break it in the future" as "lack of wanting to fix it", but you do you.
Tsarp•10mo ago
Been using the obsidian clipper since it was out and this is a really neat. The per website profile based extraction is awesome.

Even if you are not a obsidian user, the markdown extraction quality is the most reliable Ive seen.

audessuscest•10mo ago
thanks for the tip!
jonplackett•10mo ago
Does anyone know why readers don’t work for some websites where it looks like they should - ie normal article with lots of text.

You just get a completely white page (on the iPhone reader). Usually it’s a news website.

Is this the website intentionally obscuring the content to ensure they can serve their ads? If so how do they go about it?

miki123211•10mo ago
Cookie and "we care about your privacy" banners are often the cause here, especially if you're in the EU / UK / possibly California[1].

On some websites, those are just modals that obscure the content, something that reader mode can usually deal with just fine, but on others, they're implemented as redirects or rendered server-side.

If reader mode doesn't work, dismiss those first and try again.

revskill•10mo ago
Interesting that Markdown does not support form element.
Andr2Andr•10mo ago
Serious question - who and why would be using this tool? What is the use case? In other comments I have only seen exporting ChatGPT conversations to md
degosuke•10mo ago
I use LogSeq a lot - and having the option to scrape a website with only the text in MD seems like a great fit.
rollcat•10mo ago
This is a library, not a tool. You can use it for a number of purposes:

- Providing "reader mode" for your visitors

- Using it in a browser extension to add reader mode

- Scrapping

- Plugging it into a [reverse] proxy that automatically removes unnecessary bloat from pages, for e.g. easier access on retro hardware <https://web.archive.org/web/20240621144514/https://humungus....> (archive.org link, because the website goes down regularly)

timdeve•10mo ago
Looks good, I'm gonna try to swap readability in my RSS reader with this.

And with Pocket going away I might have to add save it later to it...

ulrischa•10mo ago
I have build something similar:https://devkram.de/markydown but with php. Easy for self hosting
ioma8•10mo ago
Tried it on some webpages, doesnt work well.
severusdd•10mo ago
This is very cool! Given how messy and busy many websites have become, we really need a robust markdown converter that lets readers focus on reading the content. Nice to see something stepping up where Readability left off.

Thank you for picking up this work :-)

ricardonunez•10mo ago
I’ll give it a try. I’m not happy with my current setup for markdown to HTML on the wysiwyg editor I’m using, this may provide better results if I go with my own tool bar and editor.
binarymax•10mo ago
Really nice work. I appreciate the example with JSDOM as that’s exactly how I use readability, and this looks like a nice drop-in replacement.

Question: How did you validate this? You say it works better than readability but I don’t see any tests or datasets in the repo to evaluate accuracy or coverage. Would it be possible to share that as well?

kepano•10mo ago
Currently I am relying on manual testing and user feedback, but yes, I'd like to add tests.

Defuddle works quite differently from Readability. Readability tends to be overly conservative and tends to remove useful content because it tests blocks to find the beginning and end of the "main" content.

Defuddle is able to run multiple passes and detect if it returned no content to try and expand its results. It also uses a greater variety of techniques to clean the content — for example, by using a page's mobile styles to detect content that can be hidden.

Lastly, Defuddle is not only extracting the content but also standardizing the output (which Readability doesn't do). For example footnotes and code blocks all aim to output a single format, whereas Readability keeps the original DOM intact.

honodk123•10mo ago
This looks great!

I would love to give Defuddle a try as a Readability replacement. However, for my use case I want to do in a Chrome extension background script (service worker). I have not been able to get Defuddle to work, while readability does (when combining with linkedom). So basically, while this works:

  import { parseHTML } from 'linkedom';
  ...
  private extractArticleWithReadability(html: string) {
      const { document } = parseHTML(html);
      const reader = new Readability(document);
      return reader.parse();
  }
This does not:

  import { parseHTML } from 'linkedom';
  ...
  private async extractArticleWithDefuddle(html: string) {
      const { document } = parseHTML(html);
      const result = new Defuddle(document);
      result.parse();
      return result;
  }

I get errors like:

- Error in findExtractor: TypeError: Failed to construct 'URL': Invalid URL

- Defuddle: Error evaluating media queries: TypeError: undefined is not iterable (cannot read property Symbol(Symbol.iterator))

- Defuddle Error processing document: TypeError: b.getComputedStyle is not a function

Is there a way to run Defuddle in a chrome extension background script/service worker? Or do you have any plans of adding support for that?

ahsd1•10mo ago
Cool. Im looking for something similar but for stripping signatures and boilerplate disclaimers from html email. Could this work for that?
infogulch•10mo ago
Since it's written in javascript is there any chance it could be packaged as a bookmarklet?
miketromba•10mo ago
Excellent work. A modern alternative to readability was much needed. This is especially useful for building clean web context for LLMs. Thanks for open-sourcing this!
elcritch•10mo ago
I found LLMs are really good at taking a web page and transforming it to markdown. Well rather commercial LLMs like Claude and Gemini are.

Unfortunately I tried a bunch of hugging face mode on a I could run on my MacBook and all of them ignored my prompts despite trying every variation I could think of. Half the time they just tried summarizing it and describing what JavaScript was. :/

novoreorx•10mo ago
I built a similar project called Substance [^0]. Unlike most readability tools that try to solve the problem once and for all, it takes a different approach. It provides a framework to define how each website should be handled, ensuring better results for each website covered.

[^0]: https://substance.reorx.com/

Running Tesla Model 3's computer on my desk using parts from crashed cars

https://bugs.xdavidhu.me/tesla/2026/03/23/running-tesla-model-3s-computer-on-my-desk-using-parts-...
355•driesdep•5h ago•120 comments

False claims in a widely-cited paper. No corrections. No consequences

https://statmodeling.stat.columbia.edu/2026/03/24/false-claims-in-a-published-no-corrections-no-c...
58•qsi•1h ago•16 comments

ARC-AGI-3

https://arcprize.org/arc-agi/3
263•lairv•7h ago•184 comments

My astrophotography in the movie Project Hail Mary

https://rpastro.square.site/s/stories/phm
709•wallflower•3d ago•185 comments

The EU still wants to scan your private messages and photos

https://fightchatcontrol.eu/?foo=bar
727•MrBruh•5h ago•210 comments

My DIY FPGA board can run Quake II

https://blog.mikhe.ch/quake2-on-fpga/part4.html
69•sznio•3d ago•21 comments

Earthquake scientists reveal how overplowing weakens soil at experimental farm

https://www.washington.edu/news/2026/03/19/earthquake-scientists-reveal-how-overplowing-weakens-s...
96•Brajeshwar•12h ago•40 comments

90% of Claude-linked output going to GitHub repos w <2 stars

https://www.claudescode.dev/?window=since_launch
192•louiereederson•7h ago•109 comments

Supreme Court Sides with Cox in Copyright Fight over Pirated Music

https://www.nytimes.com/2026/03/25/us/politics/supreme-court-cox-music-copyright.html
279•oj2828•11h ago•238 comments

Apple randomly closes bug reports unless you "verify" the bug remains unfixed

https://lapcatsoftware.com/articles/2026/3/11.html
292•zdw•7h ago•164 comments

Show HN: A plain-text cognitive architecture for Claude Code

https://lab.puga.com.br/cog/
26•marciopuga•2h ago•14 comments

Quantization from the Ground Up

https://ngrok.com/blog/quantization
197•samwho•10h ago•39 comments

Two Studies in Compiler Optimisations

https://www.hmpcabral.com/2026/03/20/two-studies-in-compiler-optimisations/
8•hmpc•3d ago•0 comments

Ensu – Ente’s Local LLM app

https://ente.com/blog/ensu/
333•matthiaswh•13h ago•148 comments

Woman who never stopped updating her lost dog's chip reunites with him after 11y

https://www.cbc.ca/radio/asithappens/11-year-dog-reunion-9.7140780
80•gnabgib•2h ago•28 comments

Show HN: Optio – Orchestrate AI coding agents in K8s to go from ticket to PR

https://github.com/jonwiggins/optio
16•jawiggins•9h ago•16 comments

"Disregard That" Attacks

https://calpaterson.com/disregard.html
12•leontrolski•3h ago•2 comments

Miscellanea: The War in Iran

https://acoup.blog/2026/03/25/miscellanea-the-war-in-iran/
428•decimalenough•21h ago•614 comments

Rendering complex scripts in terminal and OSC 66

https://thottingal.in/blog/2026/03/22/complex-scripts-in-terminal/
12•sthottingal•3d ago•1 comments

Thoughts on slowing the fuck down

https://mariozechner.at/posts/2026-03-25-thoughts-on-slowing-the-fuck-down/
683•jdkoeck•12h ago•336 comments

FreeCAD v1.1

https://blog.freecad.org/2026/03/25/freecad-version-1-1-released/
175•sho_hn•6h ago•54 comments

Sodium-ion EV battery breakthrough delivers 11-min charging and 450 km range

https://electrek.co/2026/03/25/sodium-ion-ev-battery-delivers-11-min-charging-450-km-range/
119•breve•5h ago•73 comments

VitruvianOS – Desktop Linux Inspired by the BeOS

https://v-os.dev
336•felixding•22h ago•202 comments

Jury finds Meta liable in case over child sexual exploitation on its platforms

https://www.cnn.com/2026/03/24/tech/meta-new-mexico-trial-jury-deliberation
305•billfor•1d ago•439 comments

Updates to GitHub Copilot interaction data usage policy

https://github.blog/news-insights/company-news/updates-to-github-copilot-interaction-data-usage-p...
232•prefork•7h ago•111 comments

The Mystery of Rennes-Le-Château, Part 1: The Priest's Treasure

https://www.filfre.net/2026/03/the-mystery-of-rennes-le-chateau-part-1-the-priests-treasure/
10•ibobev•2d ago•0 comments

Health NZ staff told to stop using ChatGPT to write clinical notes

https://www.rnz.co.nz/news/national/590645/health-nz-staff-told-to-stop-using-chatgpt-to-write-cl...
90•billybuckwheat•5h ago•31 comments

Antimatter has been transported for the first time

https://www.nature.com/articles/d41586-026-00950-w
347•leephillips•11h ago•163 comments

Tracy Kidder has died

https://www.nytimes.com/2026/03/25/books/tracy-kidder-dead.html
225•ghc•9h ago•60 comments

Meta and YouTube found negligent in landmark social media addiction case

https://www.nytimes.com/2026/03/25/technology/social-media-trial-verdict.html
420•mrjaeger•8h ago•201 comments