The issue is that for any serious use of this concept, some manual adjustment is almost always needed. This service says, "Refine your scraper at any time by chatting with the AI agent," but from what I can tell, you can't actually see the code it generates.
Relying solely on the results and asking the AI to tweak them can work, but often the output is too tailored to a specific page and fails to generalize (essentially "overfitting.") And surprisingly, this back-and-forth can be more tedious and time-consuming than just editing a few lines of code yourself. Also if you can't directly edit the code behind the scenes, there are situations where you'll never be able to get the exact result you want, no matter how much you try to explain it to the AI in natural language.
There are efforts going back at least fifteen years to extract ontologies from natural language [0] and HTML structure [1].
[0]: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&d... (2010) [PDF]
[1]: https://doi.org/10.1016/j.dss.2009.02.011 (2009)
All those details can go in the docs / faqs section.
i know that https://expand.ai/ is doing something similar, maybe worth checking out
Based on the website I was quite skeptical. It looks too much like an "indiehacker", minimum-almost-viable-product, fake-it-till-you-make-it, trolling-for-email-addresses kind of website.
But after a quick search on twitter, it seems like people are actually using it and reporting good results. Maybe I'll take a proper look at it at some point.
I'd still like to know more about pricing, how it deals with cloudflare challenges, non-semantic markup and other awkwardnesses.
It's not necessarily the structure of the source data (the DOM, the HTML etc.) but rather the translator that needs to be contractually consistent. The translator in this case is the service for the endpoints.
No, because a webpage makes no promise to not change. Even if you check every minute, can your system handle random 1 minute periods of unpredictable behavior? What if they remove data? What if the meaning of the data changes (e.g. instead of a maximum value for some field they now show the average value) how would your system deal with that? What if they are running an A/B test and 10% of your ‘API’ requests return a different page?
This is not a technical problem and the solution is not a technical one. You need to have some kind of relationship with the entity whose data you are consuming or be okay with the fact that everything can just stop working at any random moment in time.
When the cache is invalidated you refetch the html, check the sha512 hash to see if anything changed then proceed based on yes or no
Or something like that. Its not fast but hashing and comparing is fast compared to inference anyways
You get fucked by google promoting AIOs and hindustantimes articles for everything in your niche then these scrapers knocking your server offline on the other.
runningmike•6mo ago
renegat0x0•6mo ago
My project allows to define rules for various sites, so eventually everything is scraped correctly. For YouTube yet dlp is also used to augment results.
I can crawl using requests, selenium, Httpx and others. Response is via json so it easy to process.
The downside is that it may not be the fastest solution, and I have not tested it against proxies.
https://github.com/rumca-js/crawler-buddy