Download responsibly

https://blog.geofabrik.de/index.php/2025/09/10/download-responsibly/

165•marklit•3h ago

Comments

holowoodman•2h ago

Just wait until some AI dudes decide it is time to train on maps...

M95D•1h ago

Map slop? That's new!

jbstack•1h ago

AI models are trained relatively rarely, so it's unlikely this would be very noticeable among all the regular traffic. Just the occasional download-everything every few months.

holowoodman•1h ago

One would think so. If AI bros were sensible, responsible and intelligent.

However, the pratical evidence is to the contrary, AI companies are hammering every webserver out there, ignoring any kind of convention like robots.txt, re-downloading everything in pointlessly short intervals. Annoying everyone and killing services.

Just a few recent examples from HN: https://news.ycombinator.com/item?id=45260793 https://news.ycombinator.com/item?id=45226206 https://news.ycombinator.com/item?id=45150919 https://news.ycombinator.com/item?id=42549624 https://news.ycombinator.com/item?id=43476337 https://news.ycombinator.com/item?id=35701565

Waraqa•1h ago

IMHO in the long term this will lead to a closed web where you are required to log-in to view any content.

nativeit•1h ago

I’m looking forward to visiting all of the fictional places it comes up with!

cadamsdotcom•2h ago

Definitely a use case for bittorrent.

john_minsk•2h ago

If the data changes, how would a torrent client pick it up and download changes?

hambro•2h ago

Let the client curl latest.torrent from some central service and then download the big file through bittorrent.

maeln•1h ago

A lot of torrent client support various API to automatically collect torrent file. The most common is to simply use RSS.

Klinky•1h ago

Pretty sure people used or even still use RSS for this.

extraduder_ire•1h ago

There's a BEP for updatable torrents.

Gigachad•2h ago

Sounds like someone people are downloading it in their CI pipelines. Probably unknowingly. This is why most services stopped allowing automated downloads for unauthenticated users.

Make people sign up if they want a url they can `curl` and then either block or charge users who download too much.

userbinator•2h ago

I'd consider CI one of the worst massive wastes of computing resources invented, although I don't see how map data would be subject to the same sort of abusive downloading as libraries or other code.

Gigachad•2h ago

This stuff tends to happen by accident. Some org has an app that automatically downloads the dataset if it's missing, helpful for local development. Then it gets loaded in to CI, and no one notices that it's downloading that dataset every single CI run.

mschuster91•1h ago

CI itself doesn't have to be a waste. The problem is most people DGAF about caching.

stevage•50m ago

Let's say you're working on an app that incorporates some Italian place names or roads or something. It's easy to imagine how when you build the app, you want to download the Italian region data from geofabrik then process it to extract what you want into your app. You script it, you put the script in your CI...and here we are:

> Just the other day, one user has managed to download almost 10,000 copies of the italy-latest.osm.pbf file in 24 hours!

raverbashing•19m ago

Also for some reason, most CI runners seem to cache nothing except for that minor thing that you really don't want cached.

aitchnyu•2h ago

Can we identify requests from CI servers reliably?

IshKebab•1h ago

You can identify requests from Github's free CI reliably which probably covers 99% of requests.

For example GMP blocked GitHub:

https://www.theregister.com/2023/06/28/microsofts_github_gmp...

This "emergency measure" is still in place, but there are mirrors available so it doesn't actually matter too much.

ncruces•1h ago

I try to stick to GitHub for GitHub CI downloads.

E.g. my SQLite project downloads code from the GitHub mirror rather than Fossil.

Gigachad•1h ago

Sure, have a js script involved in generating a temporary download url.

That way someone manually downloading the file is not impacted, but if you try to put the url in a script it won’t work.

marklit•1h ago

I suspect web apps that "query" the GPKG files. Parquet can be queried surgically, I'm not sure if there is a way to do the same with GPKG.

rossant•2h ago

Can't the server detect and prevent repeated downloads from the same IP, forcing users to act accordingly?

jbstack•1h ago

See: "Also, when we block an IP range for abuse, innocent third parties can be affected."

Although they refer to IP ranges, the same principle applies on a smaller scale to a single IP address: (1) dynamic IP addresses get reallocated, and (2) entire buildings (universities, libraries, hotels, etc.) might share a single IP address.

Aside from accidentally affecting innocent users, you also open up the possibility of a DOS attack: the attacker just has to abuse the service from an IP address that he wants to deny access to.

imiric•1h ago

More sophisticated client identification can be used to avoid that edge case, e.g. TLS fingerprints. They can be spoofed as well, but if the client is going through that much trouble, then they should be treated as hostile. In reality it's more likely that someone is doing this without realizing the impact they're having.

imiric•1h ago

It could be slightly more sophisticated than that. Instead of outright blocking an entire IP range, set quotas for individual clients and throttle downloads exponentially. Add latency, cap the bandwidth, etc. Whoever is downloading 10,000 copies of the same file in 24 hours will notice when their 10th attempt slows down to a crawl.

tlb•25m ago

It'll still suck for CI users. What you'll find is that occasionally someone else on the same CI server will have recently downloaded the file several times and when your job runs, your download will go slowly and you'll hit the CI server timeout.

detaro•22m ago

that's working as intended then, you should be caching such things. It sucking for companies that don't bother is exactly the point, no?

aitchnyu•2h ago

Do they email heavy users? We used Nominatim free api for geocoding addresses in 2012 and our email was required parameter. They mailed us and asked us to cache results to reduce request rates.

6581•24m ago

There's no login, so they won't have any email addresses.

crimsoneer•2h ago

I continue to be baffled the geofabrik folks remain the primary way to get a clean-ish OSM shapefile. Big XKCD "that one bloke holding up the internet" energy.

Also, everyone go contribute/done to OSM.

omcnoe•1h ago

It's beneficial to the wider community, and also supports their commercial interests (OSM consulting). Win-win.

marklit•1h ago

> primary way to get a clean-ish OSM shapefile

Shapefiles shouldn't be what you're after, Parquet can almost always do a better job unless you need to either edit something or use really advanced geometry not yet supported in Parquet.

Also, this is your best source for bulk OSM data: https://tech.marksblogg.com/overture-dec-2024-update.html

If you're using ArcGIS Pro, use this plugin: https://tech.marksblogg.com/overture-maps-esri-arcgis-pro.ht...

teekert•1h ago

Whenever I read about such issues I always wonder why we all don’t make more use of BitTorrent. Why is it not the underlying protocol for much more stuff? Like container registries? Package repos, etc.

vaylian•1h ago

> Like container registries? Package repos, etc.

I had the same thoughts for some time now. It would be really nice to distribute software and containers this way. A lot of people have the same data locally and we could just share it.

maeln•1h ago

I can imagine a few things :

1. BitTorrent has a bad rep. Most people still associate it with just illegal download.

2. It requires slightly more complex firewall rules, and asking the network admin to put them in place might raise some eyebrow for reason 1. On very restrictive network, they might not want to allow them at all due to the fact that it opens the door for, well, BitTorrent.

3. A BitTorrent client is more complicated than an HTTP client, and not installed on most company computer / ci pipeline (for lack of need, and again reason 1.). A lot of people just want to `curl` and be done with it.

4. A lot of people think they are required to seed, and for some reason that scare the hell of them.

Overall, I think it is mostly 1 and the fact that you can just simply `curl` stuff and have everything working. I do sadden me that people do not understand how good of a file transfer protocol BT is and how it is underused. I do remember some video game client using BT for updates under the hood, and peertube use webtorrent, but BT is sadly not very popular.

simonmales•1h ago

At least the planet download offers BitTorrent. https://planet.openstreetmap.org/

_def•1h ago

> A lot of people think they are required to seed, and for some reason that scare the hell of them.

Some of the reasons consists of lawyers sending put costly cease and desist letters even to "legitimate" users

Fokamul•1h ago

Lol, bad rep? Interesting, in my country everybody is using it to download movies :D Even more so now, after this botched streaming war. (EU)

maeln•58m ago

Which is exactly why it has a bad rep. In most people mind BitTorrent = illegal download.

_zoltan_•43m ago

downloading movies for personal use is legal in many countries.

ahofmann•26m ago

This is a useless discussion. Imagine how the firewall-guy/network-team in your company will react to that argument.

loa_in_•9m ago

This is not a useless discussion just because it'll inconvenience someone who is at work anyway.

_flux•10m ago

How about the uploading part of it, which is behind the magic of Bittorrent and default mode of operation?

lobochrome•6m ago

Really?? Which countries allow copyright infringement by individuals?

zwnow•32m ago

I got billed 1200€ for downloading 2 movies when I was 15. I will never use torrents again.

ioteg•25m ago

You mean some asshole asked your parents for that sum to not go to a trial that they would lose and your parents paid.

zwnow•20m ago

First off it was like 2 months after my father's death we didnt have time for this, secondly my mom got an attorney that I paid. Was roughly the same amount though. We never paid them.

xzjis•31m ago

To play devil's advocate, I think the author of the message was talking about the corporate context where it's not possible to install a torrent client; Microsoft Defender will even remove it as a "potentially unwanted program", precisely because it is mostly used to download illegal content.

Obviously illegal ≠ immoral, and being a free-software/libre advocate opposed to copyright, I am in favor of the free sharing of humanity's knowledge, and therefore supportive of piracy, but that doesn't change the perception in a corporate environment.

loa_in_•10m ago

Wow, that's vile. U have many objections to this but they all boil down to M$ telling you what you cannot do with your own computer.

nativeit•1h ago

I assume it’s simply the lack of the inbuilt “universal client” that http enjoys, or that devs tend to have with ssh/scp. Not that such a client (even an automated/scripted CLI client) would be so difficult to setup, but then trackers are also necessary, and then the tooling for maintaining it all. Intuitively, none of this sounds impossible, or even necessarily that difficult apart from a few tricky spots.

I think it’s more a matter of how large the demand is for frequent downloads of very large files/sets, which leads to a questions of reliability and seeding volume, all versus the effort involved to develop the tooling and integrate it with various RCS and file syncing services.

Would something like Git LFS help here? I’m at the limit of my understanding for this.

nativeit•1h ago

I certainly take advantage of BitTorrent mirrors for downloading Debian ISOs, as they are generally MUCH faster.

nopurpose•41m ago

All Linux ISOs collectors in the world wholeheartedly agree.

mschuster91•1h ago

Trackers haven't been necessary for well over a decade now thanks to DHT.

zaphodias•1h ago

I remember seeing the concept of "torrents with dynamic content" a few years ago, but apparently never became a thing[1]. I kind of wish it did, but I don't know if there are critical problems (i.e. security?).

[1]: https://www.bittorrent.org/beps/bep_0046.html

charcircuit•1h ago

AFAIK Bittorrent doesn't allow for updating the files for a torrent.

trenchpilgrim•1h ago

> Like container registries?

https://github.com/uber/kraken exists, using a modified BT protocol, but unless you are distributing quite large images to a very large number of nodes, a centralized registry is probably faster, simpler and cheaper

marklit•1h ago

Amazon, Esri, Grab, Hyundai, Meta, Microsoft, Precisely, Tripadvisor and TomTom, along with 10s of other businesses got together and offer OSM data in Parquet on S3 free of charge. You can query it surgically and run analytics on it needing only MBs of bandwidth on what is a multi-TB dataset at this point. https://tech.marksblogg.com/overture-dec-2024-update.html

If you're using ArcGIS Pro, use this plugin: https://tech.marksblogg.com/overture-maps-esri-arcgis-pro.ht...

willtemperley•15m ago

It's just great that bounding box queries can be translated into HTTP range requests.

dotwaffle•49m ago

From a network point of view, BitTorrent is horrendous. It has no way of knowing network topology which frequently means traffic flows from eyeball network to eyeball network for which there is no "cheap" path available (potentially causing congestion of transit ports affecting everyone) and no reliable way of forecasting where the traffic will come from making capacity planning a nightmare.

Additionally, as anyone who has tried to share an internet connection with someone heavily torrenting, the excessive number of connections means overall quality of non-torrent traffic on networks goes down.

Not to mention, of course, that BitTorrent has a significant stigma attached to it.

The answer would have been a squid cache box before, but https makes that very difficult as you would have to install mitm certs on all devices.

For container images, yes you have pull through registries etc, but not only are these non-trivial to setup (as a service and for each client) the cloud providers charge quite a lot for storage making it difficult to justify when not having a check "works just fine".

The Linux distros (and CPAN and texlive etc) have had mirror networks for years that partially addresses these problems, and there was an OpenCaching project running that could have helped, but it is not really sustainable for the wide variety of content that would be cached outside of video media or packages that only appear on caches hours after publishing.

BitTorrent might seem seductive, but it just moves the problem, it doesn't solve it.

rlpb•45m ago

> From a network point of view, BitTorrent is horrendous. It has no way of knowing network topology which frequently means traffic flows from eyeball network to eyeball network for which there is no "cheap" path available...

As a consumer, I pay the same for my data transfer regardless of the location of the endpoint though, and ISPs arrange peering accordingly. If this topology is common then I expect ISPs to adjust their arrangements to cater for it, just the same as any other topology.

dotwaffle•33m ago

> ISPs arrange peering accordingly

Two eyeball networks (consumer/business ISPs) are unlikely to have large PNIs with each other across wide geographical areas to cover sudden bursts of traffic between them. They will, however, have substantial capacity to content networks (not just CDNs, but AWS/Google etc) which is what they will have built out.

BitTorrent turns fairly predictable "North/South" traffic where capacity can be planned in advance and handed off "hot potato" as quickly as possible, into what is essentially "East/West" with no clear consistency which would cause massive amounts of congestion and/or unused capacity as they have to carry it potentially over long distances they have not been used to, with no guarantee that this large flow will exist in a few weeks time.

If BitTorrent knew network topology, it could act smarter -- CDNs accept BGP feeds from carriers and ISPs so that they can steer the traffic, this isn't practical for BitTorrent!

sulandor•16m ago

bittorrent will make best use of what bandwidth is available. better think of it as a dynamic cdn which can seamlessly incorporate static cdn-nodes (see webseed).

it could surely be made to care for topology but imho handing that problem to congestion control and routing mechanisms in lower levels works good enough and should not be a problem.

alluro2•1h ago

People like Geofabrik are why we can (sometimes) have nice things, and I'm very thankful for them.

Level of irresponsibility/cluelessness you can see from developers if you're hosting any kind of an API is astonishing, so downloads are not surprising at all...If someone, a couple of years back, told me things that I've now seen, I'd absolutely dismiss them as making stuff up and grossly exaggerating...

However, on the same token, it's sometimes really surprising how API developers rarely ever think in terms of multiples of things - it's very often just endpoints to do actions on single entities, even if nature of use-case is almost never on that level - so you have no other way than to send 700 requests to do "one action".

alias_neo•22m ago

> Level of irresponsibility/cluelessness you can see from developers if you're hosting any kind of an API is astonishing

This applies to anyone unskilled in a profession. I can assure you, we're not all out here hammering the shit out of any API we find.

With the accessibility of programming to just about anybody, and particularly now with "vibe-coding" it's going to happen.

Slap a 429 (Too Many Requests) in your response or something similar using a leaky-bucket algo and the junior dev/apprentice/vibe coder will soon learn what they're doing wrong.

- A senior backend dev

zwnow•18m ago

They mention a single user downloading a 20GB file thousands of times on a single day, why not just rate limit the endpoint?

Meneth•1h ago

Some years ago I thought, no one would be stupid enough to download 100+ megabytes in their build script (which runs on CI whenever you push a commit).

Then I learned about Docker.

trklausss•58m ago

I mean, at this point I wouldn't mind if they rate-limit downloads. A _single_ customer downloading the same file 10.000 times? Sorry, we need to provide for everyone, try again at some other point.

It is free, yes, but there is no need to either abuse it or give as much resource for free as they can.

k_bx•51m ago

This. Maybe they could actually make some infra money out of this. Make token-based free tier download, pay if you break it.

stevage•52m ago

>Just the other day, one user has managed to download almost 10,000 copies of the italy-latest.osm.pbf file in 24 hours!

Whenever I have done something like that, it's usually because I'm writing a script that goes something like:

1. Download file 2. Unzip file 3. Process file

I'm working on step 3, but I keep running the whole script because I haven't yet built a way to just do step 3.

I've never done anything quite that egregious though. And these days I tend to be better at avoiding this situation, though I still commit smaller versions of this crime.

stanac•45m ago

10,000 times a day is on average 8 times a second. No way someone has 8 fixes per second, this is more like someone wanted to download a new copy every day, or every hour but they messed up milliseconds config or something. Or it's simply malicious user.

edit: bad math, it's 1 download every 8 seconds

gblargg•43m ago

When I do scripts like that I modify it to skip the download step and keep the old file around so I can test the rest without anything time-consuming.

xmprt•42m ago

My solution to this is to only download if the file doesn't exist. An additional bonus is that the script now runs much faster because it doesn't need to do any expensive networking/downloads.

cjs_ac•48m ago

I have a funny feeling that the sort of people who do these things don't read these sorts of blog posts.

globular-toast•42m ago

Ah, responsibility... The one thing we hate teaching and hate learning even more. Someone is probably downloading files in some automated pipeline. Nobody taught them that with great power (being able to write programs and run them on the internet) comes great responsibility. It's similar to how people drive while intoxicated or on the phone etc. It's all fun until you realise you have a responsibility.

vgb2k18•32m ago

Seems a perfect justification for using api keys. Unless I'm missing the nuance of this software model.

Download responsibly

Privacy and Security Risks in the eSIM Ecosystem [pdf]

How I, a beginner developer, read the tutorial you, a developer, wrote for me

A Generalized Algebraic Theory of Directed Equality

Australian telco cut off emergency calls, firewall upgrade linked to 3 deaths

Sj.h: A tiny little JSON parsing library in ~150 lines of C99

Why is Venus hell and Earth an Eden?

Simulating a Machine from the 80s

Lightweight, highly accurate line and paragraph detection

How can I influence others without manipulating them?

40k-Year-Old Symbols in Caves Worldwide May Be the Earliest Written Language

DSM Disorders Disappear in Statistical Clustering of Psychiatric Symptoms (2024)

Obsidian Note Codes

DXGI debugging: Microsoft put me on a list

I uncovered an ACPI bug in my Dell Inspiron 5567. It was plaguing me for 8 years

Why your outdoorsy friend suddenly has a gummy bear power bank

Nvmath-Python: Nvidia Math Libraries for the Python Ecosystem

Show HN: Tips to stay safe from NPM supply chain attacks

Calculator Forensics (2002)

Teach Kids Electronics Using Dough: Light Up Caterpillar Project

We Politely Insist: Your LLM Must Learn the Persian Art of Taarof

South Korea's President says US investment demands would spark financial crisis

Procedural Island Generation (VI)

I forced myself to spend a week in Instagram instead of Xcode

Node 20 will be deprecated on GitHub Actions runners

Pointer Tagging in C++: The Art of Packing Bits into a Pointer

INapGPU: Text-mode graphics card, using only TTL gates

How Isaac Newton discovered the binomial power series (2022)

RCA VideoDisc's Legacy: Scanning Capacitance Microscope

Timesketch: Collaborative forensic timeline analysis

Download responsibly

Privacy and Security Risks in the eSIM Ecosystem [pdf]

How I, a beginner developer, read the tutorial you, a developer, wrote for me

A Generalized Algebraic Theory of Directed Equality

Australian telco cut off emergency calls, firewall upgrade linked to 3 deaths

Sj.h: A tiny little JSON parsing library in ~150 lines of C99

Why is Venus hell and Earth an Eden?

Simulating a Machine from the 80s

Lightweight, highly accurate line and paragraph detection

How can I influence others without manipulating them?

40k-Year-Old Symbols in Caves Worldwide May Be the Earliest Written Language

DSM Disorders Disappear in Statistical Clustering of Psychiatric Symptoms (2024)

Obsidian Note Codes

DXGI debugging: Microsoft put me on a list

I uncovered an ACPI bug in my Dell Inspiron 5567. It was plaguing me for 8 years

Why your outdoorsy friend suddenly has a gummy bear power bank

Nvmath-Python: Nvidia Math Libraries for the Python Ecosystem

Show HN: Tips to stay safe from NPM supply chain attacks

Calculator Forensics (2002)

Teach Kids Electronics Using Dough: Light Up Caterpillar Project

We Politely Insist: Your LLM Must Learn the Persian Art of Taarof

South Korea's President says US investment demands would spark financial crisis

Procedural Island Generation (VI)

I forced myself to spend a week in Instagram instead of Xcode

Node 20 will be deprecated on GitHub Actions runners

Pointer Tagging in C++: The Art of Packing Bits into a Pointer

INapGPU: Text-mode graphics card, using only TTL gates

How Isaac Newton discovered the binomial power series (2022)

RCA VideoDisc's Legacy: Scanning Capacitance Microscope

Timesketch: Collaborative forensic timeline analysis

Download responsibly

Comments