Turn any website into an API

https://www.parse.bot

105•pcl•6mo ago

Comments

runningmike•6mo ago

Nice idea. In practice many sites have different methods to prevent scraping. Large risk on doing things manually imho.

renegat0x0•6mo ago

Huh, I I have been working on solution to that problem.

My project allows to define rules for various sites, so eventually everything is scraped correctly. For YouTube yet dlp is also used to augment results.

I can crawl using requests, selenium, Httpx and others. Response is via json so it easy to process.

The downside is that it may not be the fastest solution, and I have not tested it against proxies.

https://github.com/rumca-js/crawler-buddy

with•6mo ago

pretty cool idea. using stagehand under the hood?

vin047•6mo ago

No information on pricing on the site.

thrdbndndn•6mo ago

I scrape website content regularly (usually as one-offs) and have a hand-crafted extractor template where I just fill in a few arguments (mainly CSS selectors and some options) to get it working quickly. These days, I do sometimes ask AI to do this for me by giving it the HTML.

The issue is that for any serious use of this concept, some manual adjustment is almost always needed. This service says, "Refine your scraper at any time by chatting with the AI agent," but from what I can tell, you can't actually see the code it generates.

Relying solely on the results and asking the AI to tweak them can work, but often the output is too tailored to a specific page and fails to generalize (essentially "overfitting.") And surprisingly, this back-and-forth can be more tedious and time-consuming than just editing a few lines of code yourself. Also if you can't directly edit the code behind the scenes, there are situations where you'll never be able to get the exact result you want, no matter how much you try to explain it to the AI in natural language.

throwup238•6mo ago

I’ve had no shortage of trouble using LLMs for scrapers because for some reason they almost always ignore my instructions to use something other than the class name for selectors. They love to use the hashed class (like emotion/styled/whatever css-in-js library de jour) names that change way too often.

websiteapi•6mo ago

I'm surprised (and could be wrong), no one has made a chrome extension that just controls a page and exposes the output to localhost for consumption as an API. Similar to using chrome web driver, but without the setup.

ExxKA•6mo ago

Isnt that basically what browser-use is?

kevindamm•6mo ago

I kind of agree and don't. You could say HTTP+DOM is the API, we're already there. But it lacks the structure and a more explicit regularity (in part because it's meant for human consumption, not programming). And if you were to describe the whole protocol (including CSS and JS as they can change ordering, even content, of what's shown) it's incredibly more complicated than the equivalent, distilled representation.

There are efforts going back at least fifteen years to extract ontologies from natural language [0] and HTML structure [1].

[0]: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&d... (2010) [PDF]

[1]: https://doi.org/10.1016/j.dss.2009.02.011 (2009)

meatjuice•6mo ago

It's not a browser extension, but controlling the actual browser without using webdriver is already a thing.

https://github.com/autoscrape-labs/pydoll

_1tem•6mo ago

Way too little information on the homepage. Does this handle pagination? What about sites behind authentication? I assume the generated API is stable, i.e. the shape of the JSON will not change after a scraper is built, but what if the site changes it's DOM, does the scraper need to be regenerated? Does this attempt to defeat anti-bot and anti-scraper walls like Cloudflare?

ExxKA•6mo ago

No no, its good that is simple to understand.

All those details can go in the docs / faqs section.

slightwinder•6mo ago

Where are those docs?

ExxKA•6mo ago

I really like the simplicity of the offering. The website looks great (to a human) and explains the API idea very simply. Good stuff!

verelo•6mo ago

Mobile ux is completely broken. This would be a 5 min fix with Claude and cursor. Signals to Me that i can expect the backend to struggle with anything basic like a captcha etc.

maticzav•6mo ago

i love the idea!

i know that https://expand.ai/ is doing something similar, maybe worth checking out

Joeboy•6mo ago

This is relevant to my interests[0]

Based on the website I was quite skeptical. It looks too much like an "indiehacker", minimum-almost-viable-product, fake-it-till-you-make-it, trolling-for-email-addresses kind of website.

But after a quick search on twitter, it seems like people are actually using it and reporting good results. Maybe I'll take a proper look at it at some point.

I'd still like to know more about pricing, how it deals with cloudflare challenges, non-semantic markup and other awkwardnesses.

[0] https://github.com/Joeboy/cinescrapers

artluko•6mo ago

I saw your video on youtube really impressive

Aaargh20318•6mo ago

It’s a cute idea, but ultimately not very useful. An API is more than just an endpoint that gives easy to parse results. The most important part is that an API is a contract. An API implies that things won’t suddenly break without prior announcement. Any form of web-scraping, no matter how cleverly done, is inherently fragile. They can change their front-end for any reason which could break your scraper. As such you cannot rely on such an interface.

autonomousErwin•6mo ago

I wonder if not just checking the site every day (or minute ) would solve for this.

It's not necessarily the structure of the source data (the DOM, the HTML etc.) but rather the translator that needs to be contractually consistent. The translator in this case is the service for the endpoints.

Aaargh20318•6mo ago

> I wonder if not just checking the site every day (or minute ) would solve for this.

No, because a webpage makes no promise to not change. Even if you check every minute, can your system handle random 1 minute periods of unpredictable behavior? What if they remove data? What if the meaning of the data changes (e.g. instead of a maximum value for some field they now show the average value) how would your system deal with that? What if they are running an A/B test and 10% of your ‘API’ requests return a different page?

This is not a technical problem and the solution is not a technical one. You need to have some kind of relationship with the entity whose data you are consuming or be okay with the fact that everything can just stop working at any random moment in time.

10000truths•6mo ago

That's just part and parcel of relying on third parties - you should always price in the maintenance burden of keeping up with potential changes on their end. That burden is a lot lower if the third party cooperates with you and provides an explicit contract and backwards compatibility, but it's still not zero.

Aaargh20318•6mo ago

It’s not about the maintenance cost, it’s about continuity of service. If you scrape a website things may break at any time. If you use a proper API and have a contract with the supplier you will have the opportunity to make any changes before things break.

hoppp•6mo ago

You download the html, hash it with sha512 , then run the Ai and the webscraping and cache the api content

When the cache is invalidated you refetch the html, check the sha512 hash to see if anything changed then proceed based on yes or no

Or something like that. Its not fast but hashing and comparing is fast compared to inference anyways

Aaargh20318•5mo ago

I’m not sure what that would solve? Your API call is still broken. Best case you’re serving stale data.

Jotalea•6mo ago

It says that the backend is down, I guess I'll have to wait. Hope I don't forget about it before.

p3rls•6mo ago

It's great being an independent site in 2025.

You get fucked by google promoting AIOs and hindustantimes articles for everything in your niche then these scrapers knocking your server offline on the other.

155M US land parcel boundaries

Private Inference

Font Rendering from First Principles

Show HN: Seedance 2.0 AI video generator for creators and ecommerce

Wally: A fun, reliable voice assistant in the shape of a penguin

Rewriting Pycparser with the Help of an LLM

Lobsters Vibecoding Challenge

E-Commerce vs. Social Commerce

Avoiding Modern C++ – Anton Mikhailov [video]

Show HN: AegisMind–AI system with 12 brain regions modeled on human neuroscience

Zig – Package Management Workflow Enhancements

AI-powered text correction for macOS

AppSecMaster – Learn Application Security with hands on challenges

Fibonacci Number Certificates

AI Overviews are killing the web search, and there's nothing we can do about it

City skylines need an upgrade in the face of climate stress

1979: The Model World of Robert Symes [video]

Satellites Have a Lot of Room

1980s Farm Crisis

Show HN: FSID - Identifier for files and directories (like ISBN for Books)

Show HN: Holy Grail: Open-Source Autonomous Development Agent

Show HN: Minecraft Creeper meets 90s Tamagotchi

Show HN: Termiteam – Control center for multiple AI agent terminals

The only U.S. particle collider shuts down

Ask HN: Why do purchased B2B email lists still have such poor deliverability?

Show HN: Remotion directory (videos and prompts)

Portable C Compiler

Show HN: Kokki – A "Dual-Core" System Prompt to Reduce LLM Hallucinations

Software Engineering Transformation 2026

Microsoft purges Win11 printer drivers, devices on borrowed time

155M US land parcel boundaries

Private Inference

Font Rendering from First Principles

Show HN: Seedance 2.0 AI video generator for creators and ecommerce

Wally: A fun, reliable voice assistant in the shape of a penguin

Rewriting Pycparser with the Help of an LLM

Lobsters Vibecoding Challenge

E-Commerce vs. Social Commerce

Avoiding Modern C++ – Anton Mikhailov [video]

Show HN: AegisMind–AI system with 12 brain regions modeled on human neuroscience

Zig – Package Management Workflow Enhancements

AI-powered text correction for macOS

AppSecMaster – Learn Application Security with hands on challenges

Fibonacci Number Certificates

AI Overviews are killing the web search, and there's nothing we can do about it

City skylines need an upgrade in the face of climate stress

1979: The Model World of Robert Symes [video]

Satellites Have a Lot of Room

1980s Farm Crisis

Show HN: FSID - Identifier for files and directories (like ISBN for Books)

Show HN: Holy Grail: Open-Source Autonomous Development Agent

Show HN: Minecraft Creeper meets 90s Tamagotchi

Show HN: Termiteam – Control center for multiple AI agent terminals

The only U.S. particle collider shuts down

Ask HN: Why do purchased B2B email lists still have such poor deliverability?

Show HN: Remotion directory (videos and prompts)

Portable C Compiler

Show HN: Kokki – A "Dual-Core" System Prompt to Reduce LLM Hallucinations

Software Engineering Transformation 2026

Microsoft purges Win11 printer drivers, devices on borrowed time

Turn any website into an API

Comments