Humans get HTML, bots get markdown. Two tiny tweaks I’d make...
Send Vary: Accept so caches don’t mix Markdown and HTML.
Expose it with a Link: …; rel="alternate"; type="text/markdown" so it’s easy to discover.
You can always do the markdown -> DOM conversion on the client. Sure, there's a bit of latency there, but it means easier deployment (no build step involving pandoc or similar).
Browser-native markdown support would be better though; you'd get ability to do proper contenteditable divs with bold, italic, etc done via markdown
It can. Unlikely but possible. A good first step would be to have a well-written web component to be used like this: `<markdown>...</markdown>`, with no support at all for a build-step. The .js file implementing this should be included directly in the `<head>`.
If that gets traction (unlikely, but possible) then the standards would sooner or later introduce a tag native to the browser that does the same thing.
that's awesome. i love this line.
As of last week, impressions have also dropped. Maybe people not clicking on my links anymore is the result?
If a customer asks the AI what product can solve their problem and it replies with our product that is a huge win.
If your business is SEO spam with online ads, chatgpt might eat it. But if your business is selling some product, chatgpt might help you sell it.
What makes you think this?
The economic dynamics did not change and the methods will adapt.
Why wouldn't Google sell advertisers a prominent spot in the AI summary. That's their whole deal. Why wouldn't OpenAI do the same with (free) users.?
Yes, SEO can bring traffic to your site, but if your visitors see nothing of value, they'll quickly leave.
> The best refrigerator on the market varies based on individual needs, but top brands like LG and Samsung are highly recommended for their innovative features, reliability, and energy efficiency. For specific models, consider LG's Smart Standard-Depth MAX™ French Door Refrigerator or Samsung's smart refrigerators with internal cameras.
Optimizing your site for LLM means that you can direct their gestalt thinking towards your brand.
Yes, for prompts. Given how little XML is out on the public internet it'd be surprising if it also applies to data ingestion from web scraping functions. It'd be odd if Markdown works better than HTML to be honest, but maybe Markdown also changes the content being served e.g. there's no menu, header, or footer sent with the body content.
Also, I doubt most large-scale scrapers are running in agent loops with tool calls, so this is probably necessary for those at a minimum.
It seems “obvious” to me that if you have a tool which can request a web page, you can make it so that this tool extracts the main content from the page’s HTML. Maybe there is something I’m missing here that makes this more difficult for LLMs, because before we had LLMs, this was considered an easy problem. It is surprising to me that the addition of LLMs has made this previously easy, efficient solution somehow unviable or inefficient.
I think we should also assume here that the web site is designed to be scraped this way—if you don’t, then “Accept: text/markdown” won’t work.
If your agent sucks so bad that it isn’t capable of consuming HTML without tokenizing the whole damn thing, wouldn’t you just use an agent that isn’t such a mess?
This whole thing kinda sounds crazy inefficient to me.
There is no real reason to pass HTML with tags and all to the LLM - you can just strip the tags beforehand.
1. The Jina reader API - https://jina.ai/reader/ - add r.jina.ai to any URL to run it through their hosted conversion proxy, eg https://r.jina.ai/www.skeptrune.com/posts/use-the-accept-hea...
2. Applying Readability.js and Turndown via Playwright. Here's a shell script that does that using my https://shot-scraper.datasette.io tool: https://gist.github.com/simonw/82e9c5da3f288a8cf83fb53b39bb4...
This is much cheaper to run on a server. For example: https://github.com/ozanmakes/scrapedown
https://toffelblog.xyz/blog/gemini-overview/ https://news.ycombinator.com/item?id=23730408
https://gemini.circumlunar.space/ https://news.ycombinator.com/item?id=23042424
skeptrune•4mo ago
- https://x.com/bunjavascript/status/1971934734940098971
- https://x.com/thdxr/status/1972421466953273392
- https://x.com/mintlify/status/1972315377599447390
hahnbee•4mo ago