How I block all 26M of your curl requests

https://foxmoss.com/blog/packet-filtering/

208•foxmoss•4mo ago

Comments

keanb•4mo ago

Those bots would be really naive not to use curl-impersonate. I basically use it for any request I make even if I don’t expect to be blocked because why wouldn’t I.

f4uCL9dNSnQm•4mo ago

There are plenty of naive bots. That is why tar pits work so great at trapping them in. And this TLS based detection looks just like offline/broken site to bots, it will be harder to spot unless you are trying to scrap only that one single site.

_boffin_•4mo ago

I heard about curl-impersonate yesterday when I was hitting a CF page. Did something else to completely bypass it, which has been successful, but should try this.

VladVladikoff•4mo ago

A lot of the bots are compromised servers (eg hacked Wordpress sites), with limited control over what the TLS fingerprints look like.

unwind•4mo ago

I got exactly this far:

    uint8_t *data = (void *)(long)ctx->data;

before I stopped reading. I had to go look up the struct xdp_md [1], it is declared like this:

    struct xdp_md {
        __u32 data;
        __u32 data_end;
        __u32 data_meta;
        /* ... further fields elided ... */
    };

So clearly the `data` member is already an integer. The sane way to cast it would be to cast to the actual desired destination type, rather than first to some other random integer and then to a `void` pointer.

Like so:

    uint8_t * const data = (uint8_t *) ctx->data;

I added the `const` since the pointer value is not supposed to change, since we got it from the incoming structure. Note that that `const` does not mean we can't write to `data` if we feel like it, it means the base pointer itself can't change, we can't "re-point" the pointer. This is often a nice property, of course.

[1]: https://elixir.bootlin.com/linux/v6.17/source/include/uapi/l...

ziml77•4mo ago

Your code emits a compiler warning about casting an integer to a pointer. Changing the cast to void* emits a slightly different warning about the size of integer being cast to a pointer being smaller than the pointer type. Casting to a long and then a void* avoids both of these warnings.

fn-mote•4mo ago

Sorry, all that stuff might be true but this whole process is nuts.

The code segment containing that code looks like a no-op.

The rest of the post seems sane and well informed, so my theory is that this is a C / packet filtering idiom I’m not aware of, working far from that field.

Otherwise I’m already freaked out by treating a 32 bit field as a pointer… even if you extend it to first.

mbac32768•4mo ago

Yeah it's freaky. It's C code but it targets the eBPF virtual machine.

foxmoss•4mo ago

> Otherwise I’m already freaked out by treating a 32 bit field as a pointer… even if you extend it to first.

The cast from a 32 bit pointer to a 64 bit pointer is in fact an eBPF oddity. So what's happening here is that the virtual machine is just giving us a fake memory address just to use in the program and when the read actually needs to happen the kernel just rewrites the virtual addresses to the real ones. I'm assuming this is just a byproduct of the memory separation that eBPF does to prevent filters from accidentally reading kernel memory.

Also yes the double cast is just to keep the compiler from throwing a warning.

baobun•4mo ago

Possibly stupid question: Why does the author use different types for data and data_end in their struct?

seba_dos1•4mo ago

> with tools like Anubis being largely ineffective

To the contrary - if someone "bypasses" Anubis by setting the user agent to Googlebot (or curl), it means it's effective. Every Anubis installation I've been involved with so far explicitly allowed curl. If you think it's counterproductive, you probably just don't understand why it's there in the first place.

jgalt212•4mo ago

If you're installing Anubis, why are you setting it to allow curl to bypass?

seba_dos1•4mo ago

The problem you usually attempt to alleviate by using Anubis is that you get hit by load generated by aggressive AI scrappers that are otherwise indistinguishable from real users. As soon as the bot is polite enough to identify as some kind of a bot, the problem's gone, as you can apply your regular measures for rate limiting and access control now.

(yes, there are also people who use it as an anti-AI statement, but that's not the reason why it's used on the most high-profile installations out there)

stingraycharles•4mo ago

Yeah that makes sense. Bad players will try to look like a regular browser, good players will have no problems revealing they’re a bot.

chuckadams•4mo ago

Very similar to when I did anti-spam: the low-end spamware would make shoddy attempts at forging MUA headers that ended up being red flags that made for cheap detection rules. The successful spamware omitted extraneous headers altogether and was more standards-compliant than many of the actually legit MUAs.

lxgr•4mo ago

> As soon as the bot is polite enough to identify as some kind of a bot, the problem's gone, as you can apply your regular measures for rate limiting and access control now.

Very interesting, so we're about to come full circle?

Can't wait to have to mask myself as a (paying?) AI scraper to bypass annoying captchas when accessing "bot protected" websites...

jamesnorden•4mo ago

"War, war never changes"

mandatory•4mo ago

Good news for curl users: https://github.com/mandatoryprogrammer/thermoptic

joshmn•4mo ago

Work like this is incredible. I did not know this existed. Thank you.

mandatory•4mo ago

Thanks :) if you have any issues with it let me know.

benatkin•4mo ago

> NOTE: Due to many WAFs employing JavaScript-level fingerprinting of web browsers, thermoptic also exposes hooks to utilize the browser for key steps of the scraping process. See this section for more information on this.

This reminds me of how Stripe does user tracking for fraude detection https://mtlynch.io/stripe-update/ I wonder if thermoptic could handle that.

mips_avatar•4mo ago

Cool project!

mandatory•4mo ago

Thanks!

Symbiote•4mo ago

Oh great /s

In a month or two, I can be annoyed when I see some vibe-coded AI startup's script making five million requests a day to work's website with this.

They'll have been ignoring the error responses:

  {"All data is public and available for free download": "https://example.edu/very-large-001.zip"}

— a message we also write in the first line of every HTML page source.

Then I will spend more time fighting this shit, and less time improving the public data system.

mandatory•4mo ago

Feel free to read the README, this was already an ability that startups could pay for using private premium proxy services before thermoptic.

Having an open source version allows regular people to do scraping and not just those rich in capital.

Much of the best data services on the internet all start with scraping, the README lists many of them.

snowe2010•4mo ago

People like you are why independent sites can’t afford to run on the internet anymore.

mandatory•4mo ago

They can't? I've run many free independent sites for years, that's news to me.

timbowhite•4mo ago

I run independent websites and I'm not broke yet.

1gn15•4mo ago

I block all humans (only robots are allowed) and I'm still able to run independent websites.

OutOfHere•4mo ago

I guess we'll just throw containerized headless browsers at you those like you then. It'll only cost you more.

jacquesm•4mo ago

Or maybe just behave?

1gn15•4mo ago

Or maybe stop telling others what to do?

lxgr•4mo ago

Sounds like it'll also cost you much more, though.

ishouldbework•4mo ago

True, but if you are doing AI, money is mostly free.

lxgr•4mo ago

Eh, we'll revert to the mean at some point. Taking a loss on each unit and making up for it in volume won't scale forever.

Unless headless browsers become cheap enough for that base cost to go to effectively zero too, of course, but I trust web bloat to continue pushing out that intersection point for a bit more.

1gn15•4mo ago

It doesn't really matter since a lot are just hobby projects that can run in the background.

npteljes•4mo ago

Yes, it's essentially a cat and mouse game, with the ultimate conclusion that the game itself is destroyed - in the current case, the open internet.

geocar•4mo ago

Do you actually use this?

    $ md5 How\ I\ Block\ All\ 26\ Million\ Of\ Your\ Curl\ Requests.html

MD5 (How I Block All 26 Million Of Your Curl Requests.html) = e114898baa410d15f0ff7f9f85cbcd9d

(downloaded with Safari)

    $ curl https://foxmoss.com/blog/packet-filtering/ | md5sum
    e114898baa410d15f0ff7f9f85cbcd9d  -

I'm aware of curl-impersonate https://github.com/lwthiker/curl-impersonate which works around these kinds of things (and makes working with cloudflare much nicer), but serious scrapers use chrome+usb keyboard/mouse gadget that you can ssh into so there's literally no evidence of mechanical means.

Also: If you serve some Anubis code without actually running the anubis script in the page, you'll get some answers back so there's at least one anubis-simulator running on the Internet that doesn't bother to actually run the JavaScript it's given.

Also also: 26M requests daily is only 300 requests per second and Apache could handle that easily over 15 years ago. Why worry about something as small as that?

jacquesm•4mo ago

> Also also: 26M requests daily is only 300 requests per second and Apache could handle that easily over 15 years ago. Why worry about something as small as that?

That doesn't matter, does it? Those 26 million requests could be going to actual users instead and 300 requests per second is non-trivial if the requests require backend activity. Before you know it you're spending most of your infra money on keeping other people's bots alive.

arcfour•4mo ago

Blocking 26M bot requests doesn't mean 26M legitimate requests magically appear to take their place. The concern is that you're spending infrastructure resources serving requests that provide zero business value. Whether that matters depends on what those requests actually cost you. As the original commenter pointed out, this is likely not very much at all.

dancek•4mo ago

The article talks about 26M requests per second. It's theoretical, of course.

noAnswer•4mo ago

Not requests, packets: "And according to some benchmarks Wikipedia cites, you can drop 26 million packets per second on consumer hardware."

The Number in the Title is basically fantasy. (Not based on the authors RL experience.) So is saying a DDoS is well distributed over 24 hours.

renegat0x0•4mo ago

What I have seen it is hard to tell what "serious scrapers" use. They use many things. Some use this, some not. This is what I have learned reading webscraping on reddit. Nobody speaks things like that out loud.

There are many tools, see links below

Personally I think that running selenium can be a bottle neck, as it does not play nice, sometimes processes break, even system sometimes requires restart because of things blocked, can be memory hog, etc. etc. That is my experience.

To be able to scale I think you have to have your own implementation. Serious scrapers complain about people using selenium, or derivatives as noobs, who will come back asking why page X does not work in scraping mechanisms.

https://github.com/lexiforest/curl_cffi

https://github.com/encode/httpx

https://github.com/scrapy/scrapy

https://github.com/apify/crawlee

mrb•4mo ago

He does use it (I verified it from curl on a recent Linux distro). But he probably blocked only some fingerprints. And the fingerprint depends on the exact OpenSSL and curl versions, as different version combinations will send different TLS ciphers and extensions.

klaussilveira•4mo ago

> so there's literally no evidence of mechanical means.

Keystroke dynamics and mouse movement analysis are pretty fun ways to tackle more advanced bots: https://research.roundtable.ai/proof-of-human/

But of course, it is a game of cat and mouse and there are ways to simulate it.

efilife•4mo ago

I don't think that mouse movement analysis is used anywhere. But it was reportedly used 10 years ago by Google's captcha. This is a client side check than can trivially be bypassed

mrguyorama•4mo ago

The majority of attackers are unsophisticated and just deploying scripts they found or tools they bought on russian forums against whatever endpoint they can find.

The simple tools still work remarkably well.

There are very very effective services for bot detection that still rely heavily on keyboard and mouse behavior.

CaptainOfCoit•4mo ago

> But it was reportedly used 10 years ago by Google's captcha.

It sucks so bad. If I solve the captchas by moving the mouse too quickly, Google asks me to try again. If I'm deliberately slow and erratic with my movements as I click on pictures, it almost always lets me through on the first click. Been manually A/B testing this for years and remains true today.

efilife•4mo ago

I've been doing this for many years and not even once tried to solve a captcha at my normal speed. Will have to check myself.

chlorion•4mo ago

Claude was scraping my cgit at around 12 requests per second, but in bursts here or there. My VPS could easily handle this, even being a free tier e2-micro on Google Cloud/Compute Engine, but they used almost 10GB of my egress bandwidth in just a few days, and ended up pushing me over the free tier.

Granted it wasn't a whole lot of money spent, but why waste money and resources so "claude" can scrape the same cgit repo over and over again?

    >(1) root@gentoo-server ~ # grep 'claude' /var/log/lighttpd/access.log | wc -l
    >1099323

coppsilgold•4mo ago

There are also HTTP fingerprints. I believe it's named after akamai or something.

All of it is fairly easy to fake. JavaScript is the only thing that poses any challenge and what challenge it poses is in how you want to do it with minimal performance impact. The simple truth is that a motivated adversary can interrogate and match every single minor behavior of the browser to be bit-perfect and there is nothing anyone can do about it - except for TPM attestations which also require a full jailed OS environment in order to control the data flow to the TPM.

Even the attestation pathway can probably be defeated, either through the mandated(?) accessibility controls or going for more extreme measures. And putting the devices to work in a farm.

delusional•4mo ago

This is exactly right, and it's why I believe we need to solve this problem in the human domain, with laws and accountability. We need new copyrights that cover serving content on the web, and gives authors control over who gets to access that content, WITHOUT requiring locked down operating systems or browser monopolies.

Symbiote•4mo ago

Laws are only enforceable in their own country, and possibly some friendly countries.

If that means blocking foreign access, the problem is solved anyway.

b112•4mo ago

Laws only work in domestic scenarios.

If laws appear, the entire planet, all nations must agree and ensure prosecuting on that law. I cannot imagine that happening. It hasn't with anything compute yet.

So it'll just move off shore, and people will buy the resulting data.

Also is your nick and response sarcasm?

dpoloncsak•4mo ago

>with laws and accountability.

Isn't this how we get EU's digital ID nonsense? Otherwise, how do you hold an anon user behind 5 proxies accountable? What if its from a foreign country?

1gn15•4mo ago

The last thing we need is more intellectual property restrictions.

peetistaken•4mo ago

Indeed, I named it after akamai because they wrote a whitepaper for it. I think I first used akamai_fingerprint on https://tls.peet.ws, where you can see all your fingerprints!

palmfacehn•4mo ago

It is a cute technique, but I would prefer if the fingerprint were used higher up in the stack. The fingerprint should be compared against the User-Agent. I'm more interested in blocking curl when it is specifically reporting itself as Chrome/x.y.z.

Most of the abusive scraping is much lower hanging fruit. It is easy to identify the bots and relate that back to ASNs. You can then block all of Huawei cloud and the other usual suspects. Many networks aren't worth allowing at this point.

For the rest, the standard advice about performant sites applies.

piggg•4mo ago

Blocking on ja3/ja4 signals to folks exactly what you are up to. This is why bad actors doing ja3 randomization became a thing in the last few years and made ja3 matching useless.

Imo use ja3/ja4 as a signal and block on src IP. Don't show your cards. Ja4 extensions that use network vs http/tls latency is also pretty elite to identify folks proxying.

mrweasel•4mo ago

Some of the bad actors, and Chrome, randomize extensions, but only their order. I think it's ja3n that started to sort the extensions, before doing the hashing.

Blocking on source IP is tricky, because that frequently means blocking or rate-limiting thousands of IPs. If you're fine with just blocking entire subnets or all of AWS, I'd agree that it's probably better.

It really depends on who your audience is and who the bad actors are. For many of us the bad actors are AI companies, and they don't seem to randomize their TLS extensions. Frankly many of them aren't that clever when it comes to building scrapers, which is exactly the problem.

piggg•4mo ago

For my use cases I block src IP for some period of time (minutes). I don't block large pools of IPs as the blast radius is too large. That said - there are well established shit hosters who provide multiple /24s to proxy/dirty VPN types that are generally bad.

geek_at•4mo ago

btw you opensourced also your website

~$ curl https://foxmoss.com/.git/config [core] repositoryformatversion = 0 filemode = true bare = false logallrefupdates = true [remote "origin"] url = https://github.com/FoxMoss/PersonalWebsite fetch = +refs/heads/:refs/remotes/origin/ [branch "master"] remote = origin merge = refs/heads/master

vanyle•4mo ago

The git seems to only contain the build of the website with no source code.

The author is probably using git to push the content to the hosting server as an rsync alternative, but there does not seem to be much leaked information, apart from the url of the private repository.

hcaz•4mo ago

It exposed their committer email (I know its already public on the site, but still)

You can wget the whole .git folder and look through the commit history, so if at any point something had been pushed which should not have been its available

xdplol•4mo ago

> curl -I --ciphers ECDHE-RSA-AES128-GCM-SHA256 https://foxmoss.com/blog/packet-filtering/

mrb•4mo ago

"There’s ways to get around TLS signatures but it’s much harder and requires a lot more legwork to get working"

I wouldn't call it "much harder". All you need to bypass the signature is to choose random ciphers (list at https://curl.se/docs/ssl-ciphers.html) and you mash them up in a random order separated by colons in curl's --ciphers option. If you pick 15 different ciphers in a random order, there are over a trillion signatures possible, which he couldn't block. For example this works:

  $ curl --ciphers AES256-GCM-SHA384:AES128-GCM-SHA256:AES256-SHA:... https://foxmoss.com/blog/packet-filtering/

But, yes, most bots don't bother randomizing ciphers so most will be blocked.

ospider•4mo ago

It can be much more easier and realistic with https://github.com/lexiforest/curl-impersonate.

halJordan•4mo ago

This works for the ten minute period it takes to switch from a blacklist to a whitelist

jbrooks84•4mo ago

Very interesting

jamesnorden•4mo ago

I'm curious about why the user-agent he described can bypass Anubis, since it contains "Mozilla", sounds like a bug to me.

Edit: Nevermind, I see part of the default config is allowing Googlebot, so this is literally intended. Seems like people who criticize Anubis often don't understand what the opinionated default config is supposed to accomplish (only punish bots/scrapers pretending to be real browsers).

Show HN: Slop News – HN front page now, but it's all slop

Show HN: Empusa – Visual debugger to catch and resume AI agent retry loops

Show HN: Bitcoin wallet on NXP SE050 secure element, Tor-only open source

White House Explores Opening Antitrust Probe on Homebuilders

Show HN: MindDraft – AI task app with smart actions and auto expense tracking

How do you estimate AI app development costs accurately?

Going Through Snowden Documents, Part 5

Show HN: MCP Server for TradeStation

Canada unveils auto industry plan in latest pivot away from US

The essential Reinhold Niebuhr: selected essays and addresses

Rentahuman.ai Turns Humans into On-Demand Labor for AI Agents

StovexGlobal – Compliance Gaps to Note

Show HN: Afelyon – Turns Jira tickets into production-ready PRs (multi-repo)

Trump says America should move on from Epstein – it may not be that easy

Tiny Clippy – A native Office Assistant built in Rust and egui

LegalArgumentException: From Courtrooms to Clojure – Sen [video]

US moves to deport 5-year-old detained in Minnesota

If you lose your passport in Austria, head for McDonald's Golden Arches

Show HN: Mermaid Formatter – CLI and library to auto-format Mermaid diagrams

RFCs vs. READMEs: The Evolution of Protocols

Kanchipuram Saris and Thinking Machines

Chinese chemical supplier causes global baby formula recall

I've used AI to write 100% of my code for a year as an engineer

Looking for 4 Autistic Co-Founders for AI Startup (Equity-Based)

AI-native capabilities, a new API Catalog, and updated plans and pricing

What changed in tech from 2010 to 2020?

From Human Ergonomics to Agent Ergonomics

Advanced Inertial Reference Sphere

Toyota Developing a Console-Grade, Open-Source Game Engine with Flutter and Dart

Typing for Love or Money: The Hidden Labor Behind Modern Literary Masterpieces

Show HN: Slop News – HN front page now, but it's all slop

Show HN: Empusa – Visual debugger to catch and resume AI agent retry loops

Show HN: Bitcoin wallet on NXP SE050 secure element, Tor-only open source

White House Explores Opening Antitrust Probe on Homebuilders

Show HN: MindDraft – AI task app with smart actions and auto expense tracking

How do you estimate AI app development costs accurately?

Going Through Snowden Documents, Part 5

Show HN: MCP Server for TradeStation

Canada unveils auto industry plan in latest pivot away from US

The essential Reinhold Niebuhr: selected essays and addresses

Rentahuman.ai Turns Humans into On-Demand Labor for AI Agents

StovexGlobal – Compliance Gaps to Note

Show HN: Afelyon – Turns Jira tickets into production-ready PRs (multi-repo)

Trump says America should move on from Epstein – it may not be that easy

Tiny Clippy – A native Office Assistant built in Rust and egui

LegalArgumentException: From Courtrooms to Clojure – Sen [video]

US moves to deport 5-year-old detained in Minnesota

If you lose your passport in Austria, head for McDonald's Golden Arches

Show HN: Mermaid Formatter – CLI and library to auto-format Mermaid diagrams

RFCs vs. READMEs: The Evolution of Protocols

Kanchipuram Saris and Thinking Machines

Chinese chemical supplier causes global baby formula recall

I've used AI to write 100% of my code for a year as an engineer

Looking for 4 Autistic Co-Founders for AI Startup (Equity-Based)

AI-native capabilities, a new API Catalog, and updated plans and pricing

What changed in tech from 2010 to 2020?

From Human Ergonomics to Agent Ergonomics

Advanced Inertial Reference Sphere

Toyota Developing a Console-Grade, Open-Source Game Engine with Flutter and Dart

Typing for Love or Money: The Hidden Labor Behind Modern Literary Masterpieces

How I block all 26M of your curl requests

Comments