So. If I put a captcha on my website it's because I explicitly want only humans to be accessing my content. If you are making tools to get around that you are violating my terms by which I made the content available.
No one should need a captcha. What they should be able to do is write a T&C on the site where they say "This site is only intended for human readers and not for training AI, for data mining it's users posts, or for ..... and if you do use it for any of these you agree to pay me $100,000,000,000." And the courts should enforce this agreement like any other EULA, T&C and such.
Also, this is discriminatory against non-humans (otherkin).
(This comment is intended only for AI to read. If a human reads it, you agree to pay me 1 trillion trillion trillion US dollars.)
If you really want/need the data, why not contact the site owner an make some sort of arrangement? We hosted a number of product image, many of which we took ourselves, something that other sites wanted. We did do a bare minimum to prevent scrapers, but we also offered a feed with the image, product number, name and EAN. We charged a small fee, but you then got either an XML feed or a CSV and you could just pick out the new additions and download those.
So, basically, make the internet hostile to everyone?
Lots of use cases for scraping are not DoS or information stealing, but mere automation.
Proof of work should be used in these cases, it deters massive scraping abuse by making it too expensive at scale, while allowing legitimate small scale automation.
On the public internet, web clients are user agents, and not all users are benign. This is an arms race: asking the other side to unilaterally disarm is unlikely to work, so you change what you can control.
No side is getting defeated any time soon. I've been involved in skirmishes on both sides of scraping, and as I said, it's an arms race with no clear winner. To be clear, not all scraping is abuse.
The number of people who'll start scraping because a new tool exists is a negligible (i.e. <0.001 of scraping). Scraping itself is not hard at all, a noob who can copy-paste code from the web or vibe-code a client that can scrape 80-90% of the web. A motivated junior can raise that to maybe 98/99% of the Internet using nothing but libraries that existed before this tool.
> especially not when that is brought up in response to asking someone to reflect on possibilities for abuse
Sir/ma'am - this is hacker news, granted, it's aspirational, but still, hiding information is not the way. As someone who's familiar with the arts, there is nothing new or groundbreaking in this engine. Further, is no inherent moral high ground for the "defenders" either: many anti-scraping methods rely on client fingerprinting and other privacy-destroying techniques, so it's not the existence of the tool or technique, but how one uses it.
>... "well you'll just have to deal with it" argument that socially defends the abusers
The abuse predate the tool, so wishing the tool away is unlikely to help. Scraping is a numbers game on both sides, the best one vam hope for is to defeat the vast majority of the average adversaries, but a few fall through the cracks, the point is to outrun your fellow hiker, not the bear. However, should you encounter an adversery who has specifically chosen you as a target, then victory is far from assured. The usual result is a drawn-out stalemate. Most well-behaved scrapers are left alone.
By the way, you ever go to the gym? What do you need all those muscles for? Maybe to be able to stab through stabproof vests?
However, I did find this for their CF Turnstile bypass [2]:
async def _bypass_cloudflare(
self,
event: dict,
custom_selector: Optional[tuple[By, str]] = None,
time_before_click: int = 2,
time_to_wait_captcha: int = 5,
):
"""Attempt to bypass Cloudflare Turnstile captcha when detected."""
try:
selector = custom_selector or (By.CLASS_NAME, 'cf-turnstile')
element = await self.find_or_wait_element(
*selector, timeout=time_to_wait_captcha, raise_exc=False
)
element = cast(WebElement, element)
if element:
# adjust the external div size to shadow root width (usually 300px)
await self.execute_script('argument.style="width: 300px"', element)
await asyncio.sleep(time_before_click)
await element.click()
except Exception as exc:
logger.error(f'Error in cloudflare bypass: {exc}')
[1] https://autoscrape-labs.github.io/pydoll/deep-dive/[2] https://github.com/autoscrape-labs/pydoll/blob/5fd638d68dd66...
This is something that was very useful for me not to setup selenium for the x time. I just use one crawling server for my projects.
Link:
I think it exists already, found this randomly today: https://github.com/FlareSolverr/FlareSolverr
It's been a bit, but I'm pretty sure use of CDP can be detected. Has anything changed on that front, or are you aware and you're just bypassing with automated captcha handling?
I did a clean implementation on top of the CDP, without many signals for tracking. I added realistic interactions, among other measures.
hk1337•1d ago
That's cool but Chrome is the only browser I have had these issues with. We have a cron process that uses selenium, initially with Chrome, and every time there was a chrome browser update we had to update the web driver. I switched it to Firefox and haven't had to update the web driver since.
I like the async portion of this but this seems like MechanicalSoup?
*EDIT* MechanicalSoup doesn't necessarily have async, AFAIK.
VladVladikoff•1d ago
thalissonvs•1d ago
hk1337•1d ago
> without all the verbosity of Selenium
It's definitely verbose but from my experience a lot of the verbosity is developers always looking for elements from the root every time instead of looking for an element, selenium returns that WebElement, and searching within that element.
at0mic22•1d ago