Bad bots don't even read robots.txt.
I had other ideas too, but I don't know how well some of them will work (they might depend on what bots they are).
An alternative might be to use Brotli which has a static dictionary. Maybe that can be used to achieve a high compression ratio.
For example, with gzip using default options:
me@here:~$ pv /dev/zero -s 10M -S | gzip -c | wc -c
10.0MiB 0:00:00 [ 122MiB/s] [=============================>] 100%
10208
me@here:~$ pv /dev/zero -s 100M -S | gzip -c | wc -c
100MiB 0:00:00 [ 134MiB/s] [=============================>] 100%
101791
me@here:~$ pv /dev/zero -s 1G -S | gzip -c | wc -c
1.00GiB 0:00:07 [ 135MiB/s] [=============================>] 100%
1042069
me@here:~$ pv /dev/zero -s 10M -S | tr "\000" "\141" | gzip -c | wc -c
10.0MiB 0:00:00 [ 109MiB/s] [=============================>] 100%
10209
me@here:~$ pv /dev/zero -s 100M -S | tr "\000" "\141" | gzip -c | wc -c
100MiB 0:00:00 [ 118MiB/s] [=============================>] 100%
101792
me@here:~$ pv /dev/zero -s 1G -S | tr "\000" "\141" | gzip -c | wc -c
1.00GiB 0:00:07 [ 129MiB/s] [=============================>] 100%
1042071
Two bytes difference for a 1GiB sequence of “aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa…” (\141)
compared to a sequence of \000. 2048 yottabyte Zip Bomb
This zip bomb uses overlapping files and recursion to achieve 7 layers with 256 files each, with the last being a 32GB file.
It is only 266 KB on disk.
When you realise it's a zip bomb it's already too late. Looking at the file size doesn't betray its contents. Maybe applying some heuristics with ClamAV? But even then it's not guaranteed. I think a small partition to isolate decompression is actually really smart. Wonder if we can achieve the same with overlays.I'm sure though that if it was as simples as that we wouldn't even have a name for it.
It's just nobody usually implements a limit during decompression because people aren't usually giving you zip bombs. And sometimes you really do want to decompress ginormous files, so limits aren't built in by default.
Your given language might not make it easy to do, but you should pretty much always be able to hack something together using file streams. It's just an extra step is all.
It's intuitively extremely strange to me!
Even ignoring how zips work: Memory needs to be allocated in chunks. So before allocating a chunk, you can check if the new memory use will be over a threshold. CPU is used by the program instructions you control, so you can put checks at significant points in your program to see if it hit a threshold. Or you can have a thread you kill after a certain amount of time.
But the way zips do work makes it a lot simpler: Fundamentally it's "output X raw bytes, then repeat Y bytes from location Z" over and over. Abort if those numbers get too big.
cgroups with hard-limits will let the external tool's process crash without taking down the script or system along with it.
This is exactly the same idea as partitioning, though.
In a practical sense, how's that different from creating a N-byte partition and letting the OS return ENOSPC to you?
import zipfile
with zipfile.ZipFile("zipbomb.zip") as zip:
for name in zip.namelist():
print("working on " + name)
left = 1000000
with open("dest_" + name, "wb") as fdest, zip.open(name) as fsrc:
while True:
block = fsrc.read(1000)
if len(block) == 0:
break
fdest.write(block)
left -= len(block)
if left <= 0:
print("too much data!")
break
Also zip bombs are not comically large until you unzip them.
Also you can just unpack any sort of compressed file format without giving any thought to whether you are handling it safely.
Edit: And for folks who write their own web pages, you can always create zip bombs that are links on a web page that don't show up for humans (white text on white background with no highlight on hover/click anchors). Bots download those things to have a look (so do crawlers and AI scrapers)
It's also not a common metric you can filter on in open firewalls since you must lookup and maintain a cache of IP to ASN, which has to be evicted and updated as blocks still move around.
Automated banning is harder, you'd probably want a heuristic system and look up info on IPs.
IPv4 with NAT means you can "overban" too.
I did a version of this with my form for requesting an account on my fediverse server. The problem I was having is that there exist these very unsophisticated bots that crawl the web and submit their very unsophisticated spam into every form they see that looks like it might publish it somewhere.
First I added a simple captcha with distorted characters. This did stop many of the bots, but not all of them. Then, after reading the server log, I noticed that they only make three requests in a rapid succession: the page that contains the form, the captcha image, and then the POST request with the form data. They don't load neither the CSS nor the JS.
So I added several more fields to the form and hid them with CSS. Submitting anything in these fields will fail the request and ban your session. I also modified the captcha, I made the image itself a CSS background, and made the src point to a transparent image instead.
And just like that, spam has completely stopped, while real users noticed nothing.
<label for="gb-email" class="nah" aria-hidden="true">Email:</label>
<input id="gb-email"
name="email"
size="40"
class="nah"
tabindex="-1"
aria-hidden="true"
autocomplete="off"
>
With this CSS: .nah {
opacity: 0;
position: absolute;
top: 0;
left: 0;
height: 0;
width: 0;
z-index: -1;
}
And any form submission with a value set for the email is blocked. It stopped 100% of the spam I was getting.RIP screen reader users?
This is the main reason I haven't installed zip bombs on my website already -- on the off chance I'd make someone angry and end up having to fend off a DDoS.
Currently I have some URL patterns to which I'll return 418 with no content, just to save network / processing time (since if a real user encounters a 404 legitimately, I want it to have a nice webpage for them to look at).
Should probably figure out how to wire that into fail2ban or something, but not a priority at the moment.
I don't think it's a terrible problem to solve these days, especially if you use one of the tarpitting implementations that use nftables/iptables/eBPF, but if you have one of those annoying Chinese bot farms with thousands of IP addresses hitting your server in turn (Huawei likes to do this), you may need to think twice before deploying this solution.
(Or rather, the tarpit should be programmed to do this, whether by having a maximum resource allocation or monitoring free system resources.)
The gzip bomb means you serve 10MB but they try to consume vast quantities of RAM on their end and likely crash. Much better ratio.
[0]: https://en.wikipedia.org/wiki/Chunked_transfer_encoding
Most of the bots I've come across are fairly dumb however, and those are pretty easy to detect & block. I usually use CrowdSec (https://www.crowdsec.net/), and with it you also get to ban the IPs that misbehave on all the other servers that use it before they come to yours. I've also tried turnstile for web pages (https://www.cloudflare.com/application-services/products/tur...) and it seems to work, though I imagine most such products would, as again most bots tend to be fairly dumb.
I'd personally hesitate to do something like serving a zip bomb since it would probably cost the bot farm(s) less than it would cost me, and just banning the IP I feel would serve me better than trying to play with it, especially if I know it's misbehaving.
Edit: Of course, the author could state that the satisfaction of seeing an IP 'go quiet' for a bit is priceless - no arguing against that
ln -s /dev/zero index.html
on my home page as a joke. Browsers at the time didn’t like that, they basically froze, sometimes taking the client system down with them.Later on, browsers started to check for actual content I think, and would abort such requests.
https://medium.com/@bishr_tabbaa/when-smart-ships-divide-by-...
"On 21 September 1997, the USS Yorktown halted for almost three hours during training maneuvers off the coast of Cape Charles, Virginia due to a divide-by-zero error in a database application that propagated throughout the ship’s control systems."
" technician tried to digitally calibrate and reset the fuel valve by entering a 0 value for one of the valve’s component properties into the SMCS Remote Database Manager (RDM)"
I think this was it:
https://freedomhacker.net/annoying-favicon-crash-bug-firefox...
Years later I was finally able to open it.
Among things that didn't work were qutebrowser, icecat, nsxiv, feh, imv, mpv. I did worry at first the file was corrupt, I was redownloading it, comparing hashes with a friend, etc. Makes for an interesting benchmark, I guess.
For others curious, here's the file: https://0x0.st/82Ap.png
I'd say just curl/wget it, don't expect it to load in a browser.
Old school acdsee would have been fine too.
I think it's all the pixel processing on the modern image viewers (or they're just using system web views that isn't 100% just a straight render).
I suspect that the more native renderers are doing some extra magic here. Or just being significantly more OK with using up all your ram.
It also pans and zooms swiftly
Partially zoomed in was fine, but zooming to maximum fidelity resulted in the tab crashing (it was completely responsive until the crash). Looks like Safari does some pretty smart progressive rendering, but forcing it to render the image at full resolution (by zooming in) causes the render to get OOMed or similar.
Preview on a mac handles the file fine.
Pan&zoom works instantly with a blurry preview and then takes another 5-10s to render completely.
I suggested to try the HN beloved Sumatra PDF. Ugh, it couldn't cope with it normally. Chrome did it better coped better.
Takes a few seconds, but otherwise seems pretty ok in desktop Safari. Preview.app also handles it fine (albeit does allocate an extra ~1-2GB of RAM)
Surprisingly, Windows 95 didn't die trying to load it, but quite a lot of operations in the system took noticeably longer than they normally did.
Any ideeas?
there are other techniques. for example: hold a connection open and only push out a few bytes every few seconds - whether that's cheap for you or not depends on your servers concurrency model (if it's 1 OS thread per connection, then you'd DOS yourself with this - but with an evented model you should be good). if the bot analyzes images or pdfs you could try toxic files that exploit known weaknesses which lead to memory corruption to crash them; depends on the bots capabilities and used libraries of course.
yes "<div>"|dd bs=1M count=10240 iflag=fullblock|gzip | pv > zipdiv.gz
Resulting file is about 15 mib long and uncompresses into a 10 gib monstrosity containing 1789569706 unclosed nested divs
Also you can reverse many DoD vectors depending on how you are setup and costs. For example reverse Slowloris attack and use up their connections.
I am not sure how that could’ve worked. Unless the real /dev tree was exposed to your webserver’s chroot environment, this would’ve given nothing special except “file not found”.
The whole point of chroot for a webserver was to shield clients from accessing special files like that!
Even if you knew it was done with a symlink you don't know that - these days odds are it'd run in a container or vm, and so having access to /dev/zero means very little.
Write an ordinary static html page and fill a <p> with infinite random data using <!--#include file="/dev/random"-->.
or would that crash the server?
Ok, not a real zip bomb, for that we would need a kernel module.
Or a userland fusefs program, nice funky idea actually (with configurable dynamic filenames, e.g. `mnt/10GiB_zeropattern.zip`...
Like, a legitimate crawler suing you and alleging that you broke something of theirs?
I'll play the side of the defender and you can play the "bot"/bot deployer.
> it could serve the zip bomb to a legitimate bot.
Can you define the difference between a legitimate bot, and a non legitimate bot for me ?
The OP didn't mention it, but if we can assume they have SOME form of robots.txt (safe assumtion given their history), would those bots who ignored the robots be considered legitimate/non-legitimate ?
Almost final question, and I know we're not lawyers here, but is there any precedent in case law or anywhere, which defines a 'bad bot' in the eyes of the law ?
Final final question, as a bot, do you believe you have a right or a privilege to scrape a website ?
Well by default every bot is legitimate, an illegitimate bot might be one that’s probing for security vulnerabilities (but I’m not even sure if that’s illegal if you don’t damage the server as a side effect, ie if you only try to determine the Wordpress or SSHD version running on the server for example).
> The OP didn't mention it, but if we can assume they have SOME form of robots.txt (safe assumtion given their history), would those bots who ignored the robots be considered legitimate/non-legitimate ?
robots.txt isn’t legally binding so I don’t think ignoring it makes a bot illegitimate.
> Almost final question, and I know we're not lawyers here, but is there any precedent in case law or anywhere, which defines a 'bad bot' in the eyes of the law ?
There might be but I don’t know any.
> Final final question, as a bot, do you believe you have a right or a privilege to scrape a website ?
Well I’m not a bot but I think I have the right to build bots to scrape websites (and not get served malicious content designed to sabotage my computer). You can decline service and just serve error pages of course if you don’t like my bot.
https://en.wikipedia.org/wiki/Mantrap_(snare)
Of course their computers will live, but if you accidentally take down your own ISP or maybe some third-party service that you use for something, I'd think they would sue you.
>Disallow: /zipbomb.html
Legitimate crawlers would skip it this way only scum ignores robots.txt
The server owner can make an easy case to the jury that it is a booby trap to defend against trespassers.
I don't know of any online cases, but the law in many (most?) places certainly tends to look unfavourably on physical booby-traps. Even in the US states with full-on “stand your ground” legislation and the UK where common law allows for all “reasonable force” in self-defence, booby-traps are usually not considered self-defence or standing ground. Essentially if it can go off automatically rather than being actioned by a person in a defensive action, it isn't self-defence.
> Who […] is going to prosecute/sue the server owner?
Likely none of them. They might though take tit-for-tat action and pull that zipbomb repeatedly to eat your bandwidth, and they likely have more and much cheaper bandwidth than your little site. Best have some technical defences ready for that, as you aren't going to sue them either: they are probably running from a completely different legal jurisdiction and/or the attack will come from a botnet with little or no evidence trail wrt who kicked it off.
> pull that zipbomb repeatedly to eat your bandwidth, and they likely have more and much cheaper bandwidth than your little site.
Go read what a zip bomb is. There is one that is only a few KB, which is comparable in server load + bandwidth to a robots.txt.
No need to be a dick. Especially when you yourself are in the process of not understanding what others are saying.
I know full well what a zipbomb is. A large compressed file still has some size even in compressed form (without nesting, 1G of minimal entropy data is ~1M gzipped). If someone has noticed your bomb and worked around it by implementing relevant checks (or isn't really affected by it because of already having had those checks in place), they can get a little revenge by soaking up your bandwidth downloading it many times. OK, so nested that comes down to a few Kb, they can still throw a botnet at that, or some other content on your site, and cause you some faf, if they wish to engage in tit-for-tat action. Also: nesting doesn't work when you are using HTTP transport compression as your delivery mechanism, which is what is being discussed here: “standard” libraries supporting compressed HTTP encodings don't generally unpack nested content. There is no “Accept-Encoding: gzip+gzip” or similar.
Most, perhaps the vast majority, won't care to make the effort, so this could be considered a hypothetical, but some might. There were certainly cases, way back in my earlier days online, of junk mailers and address scrapers deliberately wasting bandwidth of sites that encouraged the use of tools like FormFucker or implemented scraper sinkholes.
Neither is the HTTP specification. Nothing is stopping you from running a Gopher server on TCP port 80, should you get into trouble if it happens to crash a particular crawler?
Making a HTTP request on a random server is like uttering a sentence to a random person in a city: some can be helpful, some may tell you to piss off and some might shank you. If you don't like the latter, then maybe don't go around screaming nonsense loudly to strangers in an unmarked area.
I just assumed court might say there is a difference between you requesting all guess-able endpoints and find 1 endpoint which will harm your computer (while there was _zero_ reason for you to access that page) and someone putting zipbomb into index.html to intentionally harm everyone.
That is not the case in this context. robots.txt is the only thing that specifies the document URL, which it does so in a "disallow" rule. The argument that they did not know the request would be responded to with hostility could be moot in that context (possibly because a "reasonable person" would have chosen not to request the disallowed document but I'm not really familiar with when that language applies).
> by deleting local files for example
This is a qualitatively different example than a zip bomb, as it is clearly destructive in a way that a zip bomb is not. True that a zip bomb could cause damage to a system but it's not a guarantee, while deleting files is necessarily damaging. Worse outcomes from a zip bomb might result in damages worthy of a lawsuit but the presumed intent (and ostensible result) of a zip bomb is to effectively cause the recipient machine to involuntarily shut down, which a court may or may not see as legitimate given the surrounding context.
The CFAA[1] prohibits:
> knowingly causes the transmission of a program, information, code, or command, and as a result of such conduct, intentionally causes damage without authorization, to a protected computer;
As far as I can tell (again, IANAL) there isn't an exception if you believe said computer is actively attempting to abuse your system[2]. I'm not sure if a zip bomb would constitute intentional damage, but it is at least close enough to the line that I wouldn't feel comfortable risking it.
[1]: https://www.law.cornell.edu/uscode/text/18/1030
[2]: And of course, you might make a mistake and incorrectly serve this to legitimate traffic.
> which is used in or affecting interstate or foreign commerce or communication, including a computer located outside the United States that is used in a manner that affects interstate or foreign commerce or communication of the United States
Assuming the server is running in the states, I think that would apply unless the client is in the same state as the server, in which case there is probably similar state law that comes into affect. I don't see anything there that excludes a client, and that makes sense, because otherwise it wouldn't prohibit having a site that tricks people into downloading malware.
Also, the protected computer has to be involved in commerce. Unless they are accessing the website with the zip bomb using a computer that also is uses for interstate or foreign commerce, it won't qualify.
So what? It isn't in the section I quoted above. I could be wrong, but my reading is that transmitting information that can cause damage with the intent of causing damage is a violation, regardless of if you "access" another system.
> Also, the protected computer has to be involved in commerce
Or communication.
Now, from an ethics standpoint, I don't think there is anything wrong with returning a zipbomb to malicious bots. But I'm not confident enough that doing so is legal that I would risk doing so.
You can't read laws in sections like that. They sections go together. The entire law is about causing damage through malicious access. But servers don't access clients.
The section you quoted isn't relevant because the entire law is about clients accessing servers, not servers responding to clients.
In the US, virtually everything is involved in 'interstate commerce'. See https://en.wikipedia.org/wiki/Commerce_Clause
> The Commerce Clause is the source of federal drug prohibition laws under the Controlled Substances Act. In a 2005 medical marijuana case, Gonzales v. Raich, the U.S. Supreme Court rejected the argument that the ban on growing medical marijuana for personal use exceeded the powers of Congress under the Commerce Clause. Even if no goods were sold or transported across state lines, the Court found that there could be an indirect effect on interstate commerce and relied heavily on a New Deal case, Wickard v. Filburn, which held that the government may regulate personal cultivation and consumption of crops because the aggregate effect of individual consumption could have an indirect effect on interstate commerce.
In particular, the interstate commerce clause is very over-reaching. It's been ruled that someone who grew their own crops to feed to their own farm animals sold locally was conducting interstate commerce because they didn't have to buy them from another state.
IANAL
(I'm half-joking, half-crying. It's how everything else works, basically. Why would it not work here? You could even go as far as explicitly calling it a "zipbomb test delivery service". It's not your fault those bots have no understanding what they're connecting to…)
But if it matters pay your lawyer and if it doesn’t matter, it doesn’t matter.
I know it's slightly off topic, but it's just so amusing (edit: reassuring) to know I'm not the only one who, after 1 hour of setting up Wordpress there's a PHP shell magically deployed on my server.
But it's such a bad platform that there really isn't any reason for anybody to use WordPress for anything. No matter your use case, there will be a better alternative to WordPress.
I've tried Drupal in the past for such situations, but it was too complicated for them. That was years ago, so maybe it's better now.
25 years ago we used Microsoft Frontpage for that, with the web root mapped to a file share that the non-technical secretary could write to and edit it as if it were a word processor.
Somehow I feel we have regressed from that simplicity, with nothing but hand waving to make up for it. This method was declared "obsolete" and ... Wordpress kludges took its place as somehow "better". Someone prove me wrong.
The other part is clients freaking out after Frontpage had a series of dangerous CVEs all in a row.
And then finally every time a part of Frontpage got popular, MS would deprecate the API and replace it with a new one.
Wordpress was in the right place at the right time.
In one, multiple users can login, edit WYSIWYG, preview, add images, etc, all from one UI. You can access it from any browser including smart phones and tablets.
In the other, you get to instruct users on git, how to deal with merge conflicts, code review (two people can't easily work on a post like they can in wordpress), previews require a manual build, you need a local checkout and local build installation to do the build. There no WYSIWYG, adding images is a manual process of copying a file, figuring out the URL, etc... No smartphone/tablet support. etc....
I switched by blog from wordpress install to a static site geneator because I got tired of having to keep it up to date but my posting dropped because of friction of posting went way up. I could no longer post from a phone. I couldn't easily add images. I had to build to preview. And had to submit via git commits and pushes. All of that meant what was easy became tedious.
I build mine with GitHub Actions and host it free on Pages.
IIRC, Eleventy printed lots of out-of-date warnings when I installed it and/or the default style was broken in various ways which didn't give me much confidence.
My younger sister asked me to help her start a blog. I just pointed her to substack. Zero effort, easy for her.
For example (not affiliated with them) https://www.siteleaf.com/
Edit: I actually feel a bit sorry for the SurrealCMS developer. He has a fantastic product that should be an industry standard, but it's fairly unknown.
> new
Pretty sure Drupal has been around for like, 20 years or so. Or is this a different Drupal?
It appears Drupal CMS is a customized version of Drupal that is easier for less tech-savvy folks to get up and running. At least, that's the impression I got reading through the marketing hype that "explains" it with nothing but buzzwords.
And only hosted option for the copyrighted code starts at 300/y
these don't cover any use case people use WordPress for.
- very hard to hack because we pre render all assets to a Cloudflare kv store
- public website and CMS editor are on different domains
Basically very hard to hack. Also as a bonus is much more reliable as it will only go down when Cloudflare does.
Could be automated better (drop ZIP to a share somewhere where it gets processed and deployed) but best of both worlds.
If they are selling anything on their website, it's probably going to be through a cloud hosted third party service and then it's just an embedded iframe on their website.
If you're making an entire web shop for a very large enterprise or something of similar magnitude, then you have to ask somebody else than me.
Everything I've built in the past like 5 years has been almost entirely pure ES6 with some helpers like jsviews.
https://survey.stackoverflow.co/2024/technology#1-web-framew...
And: compared to the other builders like Wix, Squarespace etc, you're not locked in. If you make a thing on wordpress.com or wordpress.org and want to escape, you just export your stuff in a common XML format. You get none of that with the commercial options.
So, yeh, however much HN likes to hate on it, it's still the best platform of choice for non-technicals to get stuff on the web.
Then WordPress is just your private CMS/UI for making changes, and it generates static files that are uploaded to a webhost like CloudFlare Pages, GitHub Pages, etc.
Now that plugin became a service, at which point you might just use a WP host and let them do their thing.
I think a crawler that generates a static directory from your site probably the best approach since it generalizes over any site. Even better if you're able to declare all routes ahead of time.
>Oh look 3 separate php shells with random strings as a name
Never less than 3, but always guaranteed.
I've used this teaching folks devops, here deploy your first hello world nginx server... huh what are those strange requests in the log?
There's a few plugins that do this, but vanilla WP is dangerous.
https://www.hackerfactor.com/blog/index.php?/archives/762-At...
It's not working very well.
In the web server log, I can see that the bots are not downloading the whole ten megabyte poison pill.
They are cutting off at various lengths. I haven't seen anything fetch more than around 1.5 Mb of it so far.
Or is it working? Are they decoding it on the fly as a stream, and then crashing? E.g. if something is recorded as having read 1.5 Mb, could it have decoded it to 1.5 Gb in RAM, on the fly, and crashed?
There is no way to tell.
PS: I'm on the bots side, but don't mind helping.
Anyway, from bots perspective labyrinths aren't the main problem. Internet is being flooded with quality LLM-generated content.
I've noticed that LLM scrapers tend to be incredibly patient. They'll wait for minutes for even small amounts of text.
Secondly, I know that most of these bots do not come back. The attacks do not reuse addresses against the same server in order to evade almost any conceivable filter rule that is predicated on a prior visit.
> as soon as an IP address is logged as having visited the trap URL (honeypot, or zipbomb or whatever), a log monitoring script bans that client.
Is this not why they aren’t getting the full file?
124.243.178.242 - - [29/Apr/2025:00:16:52 -0700] "GET /cgit/[...]
94.74.94.113 - - [29/Apr/2025:00:07:01 -0700] "GET /honeypot/[...]
Notice the second timestamp is almost ten minutes earlier.Many of these are annoying LLM training/scraping bots (in my case anyway). So while it might not crash them if you spit out a 800KB zipbomb, at least it will waste computing resources on their end.
I'm not a lawyer, but I'm yet to see a real life court case of a bot owner suing a company or an individual for responding to his malicious request with a zip bomb. The usual spiel goes like this: responding to his malicious request with a malicious response makes you a cybercriminal and allows him (the real cybercriminal) to sue you. Again, except of cheap talk I've never heard of a single court case like this. But I can easily imagine them trying to blackmail someone with such cheap threats.
I cannot imagine a big company like Microsoft or Apple using zip bombs, but I fail to see why zip bombs would be considered bad in any way. Anyone with an experience of dealing with malicious bots knows the frustration and the amount of time and money they steal from businesses or individuals.
This is what trips me up:
>On my server, I've added a middleware that checks if the current request is malicious or not.
There's a lot of trust placed in:
>if (ipIsBlackListed() || isMalicious()) {
Can someone assigned a previously blacklisted IP or someone who uses a tool to archive the website that mimics a bot be served malware? Is the middleware good enough or "good enough so far"?
Close enough to 100% of my internet traffic flows through a VPN. I have been blacklisted by various services upon connecting to a VPN or switching servers on multiple occasions.
A user has to manually unpack a zip bomb, though. They have to open the file and see "uncompressed size: 999999999999999999999999999" and still try to uncompress it, at which point it's their fault when it fills up their drive and fails. So I don't think there's any ethical dilemma there.
The practical effect of this was you could place a zip bomb in an office xml document and this product would pass the ooxml file through even if it contained easily identifiable malware.
The file size problem is still an issue for many big name EDRs.
Scanning them are resources intensive. The choice are (1) skip scanning them; (2) treat them as malware; (3) scan them and be DoS'ed.
(deferring the decision to human iss effectively DoS'ing your IT support team)
I would have figured the process/server would restart, and restart with your specific URL since that was the last one not completed.
What makes the bots avoid this site in the future? Are they really smart enough to hard-code a rule to check for crashes and avoid those sites in the future?
Though, bots may not support modern compression standards. Then again, that may be a good way to block bots: every modern browser supports zstd, so just force that on non-whitelisted browser agents and you automatically confuse scrapers.
it is basically a quine.
How bad the tab process dying is, depends per browser. If your browser does site isolation well, it'll only crash that one website and you'll barely notice. If that process is shared between other tabs, you might lose state there. Chrome should be fine, Firefox might not be depending on your settings and how many tabs you have open, with Safari it kind of depends on how the tabs were opened and how the browser is configured. Safari doesn't support zstd though, so brotli bombs are the best you can do with that.
[1] checkboxes demo https://checkboxes.andersmurphy.com
[2] article on brotli SSE https://andersmurphy.com/2025/04/15/why-you-should-use-brotl...
But a gzip decompressor is not turing-complete, and there are no gzip streams that will expand to infinitely large outputs, so it is theoretically possible to find the pseudo-Kolmogorov-Complexity of a string for a given decompressor program by the following algorithm:
Let file.bin be a file containing the input byte sequence.
1. BOUNDS=$(gzip --best -c file.bin | wc -c)
2. LENGTH=1
3. If LENGTH==BOUNDS, run `gzip --best -o test.bin.gz file.bin` and HALT.
4. Generate a file `test.bin.gz` LENGTH bytes long containing all zero bits.
5. Run `gunzip -k test.bin.gz`.
6. If `test.bin` equals `file.bin`, halt.
7. If `test.bin.gz` contains only 1 bits, increment LENGTH and GOTO 3.
8. Replace test.bin.gz with its lexicographic successor by interpreting it as a LENGTH-byte unsigned integer and incrementing it by 1.
9. GOTO 5.
test.bin.gz contains your minimal gzip encoding.
There are "stronger" compressors for popular compression libraries like zlib that outperform the "best" options available, but none of them are this exhaustive because you can surely see how the problem rapidly becomes intractable.
For the purposes of generating an efficient zip bomb, though, it doesn't really matter what the exact contents of the output file are. If your goal is simply to get the best compression ratio, you could enumerate all possible files with that algorithm (up to the bounds established by compressing all zeroes to reach your target decompressed size, which makes a good starting point) and then just check for a decompressed length that meets or exceeds the target size.
I think I'll do that. I'll leave it running for a couple days and see if I can generate a neat zip bomb that beats compressing a stream of zeroes. I'm expecting the answer is "no, the search space is far too large."
I would need to selectively generate grammatically valid zstd streams for this to be tractable at all.
All depends on how much magic you want to shove into an "algorithm"
Signed, a kid in the 90s who downloaded some "wavelet compression" program from a BBS because it promised to compress all his WaReZ even more so he could then fit moar on his disk. He ran the compressor and hey golly that 500MB ISO fit into only 10MB of disk now! He found out later (after a defrag) that the "compressor" was just hiding data in unused disk sectors and storing references to them. He then learned about Shannon entropy from comp.compression.research and was enlightened.
So you could access the files until you wrote more data to disk?
Brings to mind this 30+ year old IOCCC entry for compressing C code by storing the code in the file names.
0-253: output the input byte
254 followed by 0: output 254
254 followed by 1: output 255
255: output 10GB of zeroes
Of course this is an artificial example, but theoretically it's perfectly sound. In fact, I think you could get there with static huffman trees supported by some formats, including gzip.126, surely?
I tried this on my computer with a couple of other tools, after creating a file full of 0s as per the article.
gzip -9 turns it into 10,436,266 bytes in approx 1 minute.
xz -9 turns it into 1,568,052 bytes in approx 4 minutes.
bzip2 -9 turns it into 7,506 (!) bytes in approx 5 minutes.
I think OP should consider getting bzip2 on the case. 2 TBytes of 0s should compress nicely. And I'm long overdue an upgrade to my laptop... you probably won't be waiting long for the result on anything modern.
As far as I can tell, the biggest amplification you can get out of zstd is 32768 times: per the standard, the maximum decompressed block size is 128KiB, and the smallest compressed block is a 3-byte header followed by a 1-byte block (e.g. run-length-encoded). Indeed, compressing a 1GiB file of zeroes yields 32.9KiB of output, which is quite close to that theoretical maximum.
Brotli promises to allow for blocks that decompress up to 16 MiB, so that actually can exceed the compression ratios that bzip2 gives you on that particular input. Compressing that same 1 GiB file with `brotli -9` gives an 809-byte output. If I instead opt for a 16 GiB file (dd if=/dev/zero of=/dev/stdout bs=4M count=4096 | brotli -9 -o zeroes.br), the corresponding output is 12929 bytes, for a compression ratio of about 1.3 million; theoretically this should be able to scale another 2x, but whether that actually plays out in practice is a different matter.
(The best compression for brotli should be available at -q 11, which is the default, but it's substantially slower to compress compared to `brotli -9`. I haven't worked out exactly what the theoretical compression ratio upper bound is for brotli, but it's somewhere between 1.3 and 2.8 million.)
Also note that zstd provides very good compression ratios for its speed, so in practice most use cases benefit from using zstd.
There's literally no machine on Earth today that can deal with that (as a single file, I mean).
Oh? Certainly not in RAM, but 4 PiB is about 125x 36TiB drives (or 188x 24TiB drives). (You can go bigger if you want to shell out tens of thousands per 100TB SSD, at which point you "only" need 45 of those drives.)
These are numbers such that a purpose-built server with enough SAS expanders could easily fit that within a single rack, for less than $100k (based on the list price of an Exos X24 before even considering any bulk discounts).
42.zip has five layers. But you can make a zip file that has an infinite number of layers. See https://research.swtch.com/zip or https://alf.nu/ZipQuine
Surely, the device does crash but it isn’t destroyed?
And to heck with cloudflare :S We don't need 3 companies controlling every part of the internet.
[0] https://www.bamsoftware.com/hacks/zipbomb/ [1] https://www.bamsoftware.com/hacks/zipbomb/#safebrowsing
I tried to contact the admin of the box (yeah that’s what people used to do) and got nowhere. Eventually I sent a message saying “hey I see your machine trying to connect every few seconds on port <whatever it is>. I’m just sending a heads up that we’re starting a new service on that port and I want to make sure it doesn’t cause you any problems.”
Of course I didn’t hear back. Then I set up a server on that port that basically read from /dev/urandom, set TCP_NODELAY and a few other flags and pushed out random gibberish as fast as possible. I figured the clients of this service might not want their strings of randomness to be null-terminated so I thoughtfully removed any nulls that might otherwise naturally occur. The misconfigured NT box connected, drank 5 seconds or so worth of randomness, then disappeared. Then 5 minutes later, reappeared, connected, took its buffer overflow medicine and disappeared again. And this pattern then continued for a few weeks until the box disappeared from the internet completely.
I like to imagine that some admin was just sitting there scratching his head wondering why his NT box kept rebooting.
Pretty neat.
I went down a crazy rabbit hole and found a bunch of domains that were random parts of street addresses. Obviously created automatically and they were purposely trying to make it harder to find related domains.
You can also limit the wider process or system your request is part of.
Time limits tend to also defacto limit size, if bandwidth is somewhat constrained.
Timeouts and size limits are trivial to update as legitimate need is discovered.
Practically speaking, putting an arbitrary size limit somewhere is like putting yet-another-ssl-cert-that-needs-to-be-renewed in some critical system. It will eventually cause an outage you aren’t expecting.
Will there be a plausible someone to blame? Of course. Realistically, it was also inevitable someone would forget and run right into it.
Time limits tend to not have this issue, for various reasons.
I found a fix for this some years back:
openssl req -x509 -days 36500
No, not at all. A TLS cert that expires takes the whole thing down for everyone. A size limit takes one operation down for one user.
Eg if you fork for every request, that process only serves that one user. Or if you can restart fast enough.
I'm mostly inspired by Erlang here.
I had a lazy fix for a down detection on my RPi server at home, it was pinging a domain I owned and if it couldn't hit that assumed it wasn't connected to a network/rebooted itself. I let the domain lapse and this RPi kept going down around 5 minutes... thought it was a power fault, then I remembered about that CRON job.
Disclaimer: I put a lot of servers on the Internet in the 90’s/early 2000’s. It was industry-wide standard practice: ‘use NT so you don’t need an admin’.
Rather not write it myself
Here's a start hacked together and tested on my phone:
perl -lnE 'if (/GET ([^ ]+)/ and $p=$1) {
$s=qx(curl -sI https://BASE_URL/$p | head -n 1);
unless ($s =~ /200|302/) {
say $p
}
}'
https://github.com/danielmiessler/SecLists/blob/master/Disco...
I combined a few of the most interesting lists from here into one and never miss an attack now
10T is probably overkill though.
Think about it:
$ dd if=/dev/zero bs=1 count=10M | gzip -9 > 10M.gzip
$ ls -sh 10M.gzip
12K 10M.gzip
Other than that, why serve gzip anyway? I would not set the Content-Length Header and throttle the connection and set the MIME type to something random, hell just octet-stream, and redirect to '/dev/random'.I don't get the 'zip bomb' concept, all you are doing is compressing zeros. Why not compress '/dev/random'? You'll get a much larger file, and if the bot receives it, it'll have a lot more CPU cycles to churn.
Even the OP article states that after creating the '10GB.gzip' that 'The resulting file is 10MB in this case.'.
Is it because it sounds big?
Here is how you don't waste time with 'zip bombs':
$ time dd if=/dev/zero bs=1 count=10M | gzip -9 > 10M.gzip
10485760+0 records in
10485760+0 records out
10485760 bytes (10 MB, 10 MiB) copied, 9.46271 s, 1.1 MB/s
real 0m9.467s
user 0m2.417s
sys 0m14.887s
$ ls -sh 10M.gzip
12K 10M.gzip
$ time dd if=/dev/random bs=1 count=10M | gzip -9 > 10M.gzip
10485760+0 records in
10485760+0 records out
10485760 bytes (10 MB, 10 MiB) copied, 12.5784 s, 834 kB/s
real 0m12.584s
user 0m3.190s
sys 0m18.021s
$ ls -sh 10M.gzip
11M 10M.gzip
The compression ratio is the whole point... if you can send something small for next to no $$ which causes the receiver to crash due to RAM, storage, compute, etc constraints, you win.
>>> from requests import get
>>> r = get("https://acme.tld/trap/")
>>> r.text
The server doesn't do much (serving a relatively small number of bytes) while the client basically crashes.How accurate is that middleware? Obviously there are false negatives as you supplement with other heuristics. What about false positives? Just collateral damage?
> A well-optimized, lightweight setup beats expensive infrastructure. With proper caching, a $6/month server can withstand tens of thousands of hits — no need for Kubernetes.
----
[1] Though doing this in order to play/learn/practise is, of course, understandable.
For all those "eagerly" fishing for content AI bots I ponder if I should set up a Markov chain to generate semi-legible text in the style of the classic https://en.wikipedia.org/wiki/Mark_V._Shaney ...
You need that to protect against not only these types of shenanigans, but also large or slow responses.
https://blog.haschek.at/2017/how-to-defend-your-website-with...
codingdave•5mo ago
cratermoon•5mo ago
xena•5mo ago
cookiengineer•5mo ago
wiredfool•5mo ago
Amazon's scraper doesn't back off. Meta, google, most of the others with identifiable user agents back off, Amazon doesn't.
toast0•5mo ago
cratermoon•5mo ago
toast0•5mo ago
tcpdrop shouldn't self DOS though, it's using less resources. Even if other end does a retry, it will do it after a timeout; in the meantime, the other end has a socket state and you don't, that's a win.
deathanatos•5mo ago
(I don't think your blog qualifies as shady … but you're not in my allowlist, either.)
So if I visit https://anubis.techaro.lol/ (from the "Anubis" link), I get an infinite anime cat girl refresh loop — which honestly isn't the worst thing ever?
But if I go to https://xeiaso.net/blog/2025/anubis/ and click "To test Anubis, click here." … that one loads just fine.
Neither xeserv.us nor techaro.lol are in my allowlist. Curious that one seems to pass. IDK.
The blog post does have that lovely graph … but I suspect I'll loop around the "no cookie" loop in it, so the infinite cat girls are somewhat expected.
I was working on an extension that would store cookies very ephemerally for the more malicious instances of this, but I think its design would work here too. (In-RAM cookie jar, burns them after, say, 30s. Persisted long enough to load the page.)
lcnPylGDnU4H9OF•5mo ago
Is your browser passing a referrer?
cycomanic•5mo ago
I used cookie blockers for a long time, but always ended up having to whitelist some sites even though I didn't want their cookies because the site would misbehave without them. Now I just stopped worrying.
xena•5mo ago
theandrewbailey•5mo ago
codingdave•5mo ago
chmod775•5mo ago
"Hurting people is wrong, so you should not defend yourself when attacked."
"Imprisoning people is wrong, so we should not imprison thieves."
Also the modern telling of Robin Hood seems to be pretty generally celebrated.
Two wrongs may not make a right, but often enough a smaller wrong is the best recourse we have to avert a greater wrong.
The spirit of the proverb is referring to wrongs which are unrelated to one another, especially when using one to excuse another.
zdragnar•5mo ago
The logic of terrorists and war criminals everywhere.
BlackFingolfin•5mo ago
_Algernon_•5mo ago
Do you really want to live in a society were all use of punishment to discourage bad behaviour in others? That is a game theoretical disaster...
toss1•5mo ago
Crime and Justice are not the same.
If you cannot figure that out, you ARE a major part of the problem.
Keep thinking until you figure it out for good.
impulsivepuppet•5mo ago
cantrecallmypwd•5mo ago
This is exactly what Californian educators told kids who were being bullied in the 90's.
imiric•5mo ago
bsimpson•5mo ago
They made the request. Respond accordingly.
joezydeco•5mo ago
https://williamgibson.fandom.com/wiki/ICE
gherard5555•5mo ago
petercooper•5mo ago