They are both one line.
(Disclaimer, I work on duckdb-spatial @duckdblabs)
If I can reduce my spatial analysis to SQL and optimize that SQL a little bit, duckdb will happily saturate my CPUs and memory to process things fast in a way that a Python script struggles to do.
Yes, the static linking is nice. It helps with the problem of reproducible environments in Python. But that's not a game changer IMO.
Docker has been a big improvement (when I was first learning PostGIS, the amount of time I had to hunt for proj directories or compile software just to install the plugin was a major hurdle), but it's many steps away from:
``` $ duckdb D install spatial; ```
Getting started from 0 with geo can be difficult for those unfamiliar. DuckDB packages everything into one line with one dependency.
No, it works like SQLite.
I have no major qualm with pandas and geopandas. However I use it when it's the only practical solution, not because I enjoy using it as a library. It sounds like pandas (or similar) vs a database?
If you're using JavaScript you install Turf. The concept that you can readily install spatial libraries is hardly earth shattering.
(Geopandas is great, too.)
Most of the time I store locations and compute distance to them. Would that being faster to implement with duckdb
In my use case, I use DuckDB because of speed at scale. I have 600GBs of lat-longs in Parquet files on disk.
If I wanted to use Postgis, I would have to ingest all this data into Postgres first.
With DuckDB, I can literally drop into a Jupyter notebook, and do this in under 10 seconds, and the results come back in a flash: (no need to ingest any data ahead of time)
import duckdb
duckdb.query("INSTALL spatial; LOAD spatial;")
duckdb.query("select ST_DISTANCE(ST_POINT(lng1, lat1), ST_POINT(lng2, lat2)) dist from '/mydir/*.parquet'")
Also as a side note, is everyone just using DuckDB in memory? Because as soon as you want some multiple session stuff I'd assume you'd use DuckDB on top of a local database, so again I don't see the point but I'm sure I'm missing something.
Usually new data is generated regularly and would require creating a separate ETL process to ingest into Postgres. With DuckDB, no ETL is needed. New Parquet files are just read off the disk.
> Also as a side note, is everyone just using DuckDB in memory?
DuckDB is generally used as a single-user, and yes in-memory use case is most common. Not sure about use cases where a single user requires multiple sessions? But DuckDB does have read concurrency, session isolation etc. I believe write serialization is supported in multiple sessions.
With Parquet files, it's append-only so the "write" use-cases tend to be more limited. Generally another process generates those Parquet files. DuckDB just works with them.
This part was not obvious. In a lot of cases geodata is mostly stable and reads/searches dominate over appends. And that’s why we keep this in DB (usually postgis, yes).
So DuckDB is optimised for very different use case and it is not always obvious when it’s mentioned
DuckDB also provides a vectorized, parallelized engine. When I run a query all of my 32 cores light up on htop.
If they are stored in DuckDB’s native format (which I don’t use), it supports some state of the art indices.
https://duckdb.org/docs/stable/sql/indexes.html
I find Parquet plenty fast though.
Folks who have been doing DevOps work are exasperated with crummy SaaS vendors or antiquated OSS options that have a high setup cost. DuckDB is just a mature project that offers an alternative, hence an easy fan favorite among hobbyists (I imagine at scale the opportunity costs change and it becomes less attractive).
I'm still getting feedback that many devs are not too comfortable with reading and writing SQL. They learned simple SELECT statements in school, but get confused by JOINs and GROUP BYs.
I'd pick that before traveling the DuckDB path.
SQL is burned into my muscle memory because I work on ML models with large datasets. It took me a lot of trial and error to get decent at SQL. I imagine most devs just don't have the reps because their work rarely exposes them to SQL.
SQL queries are one area where correctness matters a lot, because downstream applications rely on them to be correct. If you get a query wrong, especially in an ETL process, you can generate a lots of garbage data for a very long time (I'm speaking from lived experience). It might take a long time to correct (via backfill) and sometimes the original data might no longer be available.
I use LLMs to generate ideas for writing complex SQL, but I always need to understand the query 100% first before I deploy. Mistakes are costly, and sometimes irreversible. I need to trust but verify, which requires deep SQL knowledge.
To write correct SQL, you not only have to know the syntax, but also the schema (easily provided to an LLM) and the expected data values (you can provide a sampling to an LLM which helps, but domain knowledge helps even more). There are lots of surprising things in data that only reveal themselves via visual inspection. (though I think with MCO, an LLM will one day be able to inspect the data itself by sampling the database)
if i already expect it to be making at least some of a mess, why not have AI as part of the setup?
So the operational savings are more for using it as a light weight data science tool that excels in specific workloads like geospacial. Wherever you need to do those computations more locally it will excel.
I don’t really think sql is a mandated devops skill, though a basic understanding of databases certainly is. Between online content, LLMs, as long as you can iteratively improve a query part by part through explain analyze etc it’ll be sufficient.
Yes, DuckDB does a whole lot more, vectorized larger-than-memory execution, columnar compressed storage and a ecosystem of other extensions that make it more than the sum of its parts. But while Ive been working hard on making the spatial extension more performant and more broadly useful (I designdd a new geometry engine this year, and spatial join optimization just got merged on the dev-branch), the fact that you can e.g. convert too and from a myriad of different geospatial formats by utilizing GDAL, transforming through SQL, or pulling down the latest overture dump without having the whole workflow break just cause you updated QGIS has probably been the main killer feature for a lot of the early adopters.
(Discmaimer, I work on duckdb-spatial @ duckdblabs)
I'm curious if it compresses them better or something like that. I see lots of people online saying it compresses well (but mostly compared to .shp or similar) but normal parquet (.gz.parquet or .snappy.parquet) already does that really well. So it's not clear to me if I should spend time investigating it...
I mostly process normal parquet with spark and sometimes clickhouse right now.
When using EPSG:4326 lat/lons, I don't think GeoParquet would give you any benefits over just having separate columns (this is what I typically do, and it's plenty fast).
If you are using range requests to fetch only parts of parquet files at a time, you could potentially sort your data using a hilbert curve, which could limit the number of row groups that need to be fetched to execute a query.
It's just something the writer of the article should probably have at least mentioned when going full hyperbole with the title (hey, after reading this, it might actually be justified! :) ).
Rereading the article that focuses on the one thing that isn't a standout (the installation itself), though, I can't help but chuckle and think "worst sales pitch ever". ;)
and for the last twenty, not ten years, this is what PostGIS was pioneering, and also teaching everyone get used to. DuckDB was not something that people even knew in GIS world. I'm not even sure whether QGIS connects to DuckDB, perhaps it does for a while, but it sure knows Spatialite for very long and last, but not least - ESRI sure as f*ck still have not heard of DuckDB. This is already half the geospatial world out there.
This whole article is superb biased and its very sad.
which does not change my original statement that nobody cared about DuckDB for very long, while the whole server-side processing idea is largely based on PostGIS.
Yeah, I feel old.
But what DuckDB as an engine does is it lets me work directly on parquet/geoparquet files at scale (vectorized and parallelized) on my local desktop. It beats geopandas in that respect. It's a quality of life improvement to say the least.
DuckDB also has an extension architecture that admits more exotic geospatial features like Hilbert curves, Uber H3 support.
https://duckdb.org/docs/stable/extensions/spatial/functions....
I’ve been doing the reprojection thing, projecting coordinates to a “local” CRS, for previous projects mainly because that’s what geopandas recommend and is built around, but I am reaching a stage where I’d like to calculate distance for objects all over the globe, and I’m genuinely interested to learn what’s a good choice here.
Geodesics are the most accurate (Vincenty etc) but are computationally heavy.
Haversine is a nice middle ground.
I get that drawing a projection is inaccurate at scale, but if I do all my calculation as meters north/south and east/west of 0,0 on the equator, won’t all my distance calculations be correct?
Like 5.000.000 east and 0 m north is 5.000km from the origin. I cannot see how that could ever become inaccurate as I move further away.
Where is the inaccuracy introduced? When I reproject back to coordinates on a globe?
The spatial features were just barely out when I was last doing a lot of heavy geospatial work, but even then they were very nice.
An aside, I had a Junior who would just load datasets into PowerBI to explore them for the first time, and that was actually a shockingly useful workflow.
pandas is very nice and was my bread and butter for a long time, but I frequently ran into memory issues and problems at scale with pandas, which I would never hit with polars or duckdb. I'm not sure if this holds true today as I know there's been updates, but it was certainly a problem then. Using geopandas ran into the same issues.
Just using GDAL and other libraries out of the box is frankly not a great experience. If you have a QGIS (another wonderful tool) workflow, it's frustrating to be dropping into Jupyter notebooks to do translations, but that seemed to be the best option.
In general, it just feels like geospatial analysis is about 10 years behind regular data analysis. Shapefiles are common because of ESRI dominance, but frankly not a great format. PostGIS is great, geopandas is great, but there's a lot more things in the data ecosystem than just Postgres and pandas. PowerBI barely had geospatial support a couple years ago. I think PowerBI Shapemaps exclusively used TopoJSON?
All of this is to say, DuckDB geospatial is very cool and helpful.
What was shockingly useful in PowerBI compared to DuckDB?
"Hexagons were an important choice because people in a city are often in motion, and hexagons minimize the quantization error introduced when users move through a city. Hexagons also allow us to approximate radiuses easily, such as in this example using Elasticsearch."
[Edit]Maybe https://www.researchgate.net/publication/372766828_YOLOv8_fo...
This article doesn't acknowledge how niche this stuff is and it's a lot of training to get people to up to speed on coordinate systems, projections, transformations, etc. I would replace a lot of my custom built mapping tools with Felt if it were possible, so I could focus on our core geospatial processes and not the code to display and play with it in the browser, which is almost as big if not bigger in terms of LOC to maintain.
As mentioned by another commenter, this DuckDB DX as described is basically the same as PostGIS too.
Yes, there are so many great tools to handle the complexity for the capital-G Geospatial work.
I love Felt too! Sam and team have built a great platform. But lots of times a map isn't needed; an analyst just needs it as a column.
PostGIS is also excellent! But having to start up a database server to work with data doesn't lend itself to casual usage.
The beauty of DuckDB is that it's there in a moment and in reach for data generalists.
To me, I think it's mostly a frontend problem stopping the spread of mapping in consumer apps. Backend geo is easy tbh. There is so much good, free tooling. Mapping frontend is hell and there is no good off the shelf solution I've seen. Some too low level, some too high level. I think we need a GIS-lite that is embeddable to hide the complexity and let app developers focus on their value add, and not paying the tax of having frontend developers fix endless issues with maps they don't understand.
edit: to clarify, I think there's a relationship between getting mapping valued by leadership such that the geo work can be even be done by analysts, and having more mapping tools exist in frontend apps such that those leaders see them and understand why geo matters. it needs to be more than just markers on the map, with broad exposure. hence my focus on frontend web. sorry if that felt disjointed
DuckDB is great, but the fact that it makes it easier for data generalists to make mistakes with geospatial data is mark against it, not in its favor.
https://georeferenced.wordpress.com/2014/05/22/worldmapblund... https://www.economist.com/asia/2003/05/15/correction-north-k...
This can mostly be avoided entirely with a proper spheroidal reference system, computational geometry implementation, and indexing. Most uses of geospatial analytics are not cartographic in nature. The map is at best a presentation layer, it is not the data model, and some don’t use a map at all. Forcing people to learn obscure and esoteric cartographic systems to ask simple intuitive questions about geospatial relationships is a big part of the problem. There is no reason this needs to be part of the learning curve.
I’ve run experiments on unsophisticated users a few times with respect to this. If you give them a fully spheroidal WGS84 implementation for geospatial analytics, it mostly “just works” for them anywhere on the globe and without regard for geospatial extent. Yes, the software implementation is much less trivial but it is qualitatively superior UX because “the world” kind of behaves how people intuit it should without having to know anything about projections, transforms, etc. And to be honest, even if you do know about projections and transforms, the results are still often less than optimal.
The only issue that comes up is that a lot of cartographic visualization toolkits are somewhat broken if you have global data models or a lot of complex geometry. Lots of rendering artifacts. Something else to work on I guess.
Aren't most geospatial tools just doing simple geometry? And therefore need to work on some sort of projection?
If you can do the math on the spheroidal model, ok you get better results and its easier to intuit like you said, but it's much more complicated math. Can you actually do that today with tools like QGIS and GDAL?
It is entirely possible to do this in databases. That is how it is actually done. The limitations of GEOS are not the limitations of software, it is not a particularly sophisticated implementation (even PostGIS doesn’t use it for the important parts last I checked). To some extent you are affirming that there is a lack of ambition in this part of the market in open source.
I can sort of see your point about the merits of global, spheroidal geometry, certainly from a user's perspective. But there's no getting around the fact that the geometry calculations are both slower (I'm tempted to say "inherently"…) and far more complex to implement (just look at how painful it is to write a performant, accurate r- or r*-tree for spherical coordinates) along every dimension. That's not going to change any time soon, so the projection workflow probably isn't going anywhere.
I'm actually curious, speaking for yourself, what kind of analysis you're doing where something like NAD83, or UTM does not give you enough precision? Is this actually "real world" geospatial data? If I have a soil model, I have a very localized analysis, and if I have a global climate model, we're talking kilometers for grid cells. In all these cases, the collected data has built in geolocation error MUCH grater than most decent projections...
So, what analysis are you doing where you need centimeter precision at global scale of thousands of kilometers? Sounds really interesting. The only time I've seen this, is doing space flight simulations where the error really accumulates into the future.
An additional issue is that the spheroidal implementations have undergone very little optimization, perhaps because they are not the defaults. So when people figure out how to turn them on, performance is suddenly terrible. Now you have people that believe spheroidal implementations are terribly slow, when in reality they just used a pathologically slow implementation. Really good performance-engineered spheroidal implementations are much faster than people assume based on the performance of open source implementations.
Yes, spherical representations of polygon data exist, but the data you import has already been "split" and undoing that is often impossible, or at best non-trivial. And then rasters are fundamentally impossible to represent that way.
Analysis uses projections for that reason. Spherical approaches aren't fundamentally "better" for most use cases. They're only strictly better if everything you're working with is a point.
There's more to geospatial than point datasets.
But any type of vector data could be modeled on a sphere, right? Points, shapes, lines. And I saw "better" because even the best suited projection will have some small amount of distortion.
Either way, most things use planer geometry so projections are necessary, and you need to have some understanding of how all that works
No, its not.
Good god.
QGIS is amazing. It's really great. It also came out in 2002, so I think the headline is safe.
I've been able to "CREATE EXTENSION postgis;" for more than a decade. There have been spatial extensions for PG, MySQL, Oracle, MS SQL Server, and SQLite for a long time. DuckDB doesn't make any material difference in how easy it is to install.
DuckDB on the other hand works with data as-is (Parquet, TSV, sqlite, postgres... whether on disk, S3, etc.) with requiring an ETL step (though if the data isn't already in a columnar format, things are gonna be slow... but it will still work).
I work with Parquet data directly with no ETL step. I can literally drop into Jupyter or a Python REPL and duckdb.query("from '*.parquet'")
Correct me if I'm wrong, but I don't think that's possible with Postgis. (even pg_parquet requires copying? [1])
[1] https://www.crunchydata.com/blog/pg_parquet-an-extension-to-...
I think this is a fundamental problem with the SQL pattern. You can try to make things just work, but when they fall then what?
SQL is a DSL and yes, all Domain Specific Languages will only enable what the engine parsing the DSL supports.
But all SQL database I'm aware of let you write custom extensions, which are exactly that: they extend the base functionality of the database with new paradigms. I.e postgis enabling geospatial in postgres or the extensions that enable fuzzy-matching/searching.
And as SQL is pretty much a turing-complete DSL, there is very little you can't do with it, even if the syntax might not agree with everyone
SELECT * FROM './path/to/some/shapefile/dataset.shp';
COPY table TO 'some/file/path/filename.geojson' WITH (FORMAT gdal, DRIVER 'GeoJSON', LAYER_CREATION_OPTIONS 'WRITE_BBOX=YES');
This seems extremely helpful to me.
ogr2ogr -f "PostgreSQL" -t_srs "EPSG:2274" PG:"host=host user=user dbname=database password=trustno1 schemas=schema" shapefile.shp -nln new_name_for_table
I'll give DuckDB a shot and compare to postgis. Eventually, I think the issue for me would be that I use all the vector outputs via apps connecting to the same database.
For geospatial analysis, the most important thing that could happen in software would be no longer treating it, either explicitly or implicitly, as having anything to do with cartography. Many use cases are not remotely map-driven but the tools require users to force everything through the lens of map-making.
Strongly agree with your sentiment around maps: most people can’t read them, they color the entire workflow and make it more complex, and (imo) lead to a general undervaluing of the geospatial field. Getting the data into columns means it’s usable by every department.
This leads to some complex situations where some builds would contravene LGPL 2.1 e.g static linking with a closed source application.
If you want to import something and work with it GeoPandas exists. If you want something integrated with a SQL database, PostGIS exists.
On the application side of the spectrum, GRASS GIS, QGIS etc. say hi. They are being used in the agriculture industry and by government agencies (at least I know that's the case in Germany and Brazil).
Should convert them to parquet then access and analytics becomes cheap and fast.
This is a new paradigm for folks who aren’t in big data — the conventional approach usually involves doing a row INSERT. In big data, appending simply means adding a new file - the database engine will immediately recognize its presence. This is why “select * from ‘*.parquet’” will always operate on the latest dataset.
For example, my data is usually batched by yearwk (year + week no), so my directory structure looks like this:
/data/yearwk=202501/000.parquet
/data/yearwk=202502/000.parquet
This is also called the Hive directory structure. When I query, I just do: select * from '/data/**/*.parquet';
This is a paradigm shift from standard database thinking for handling truly big data. It's append-only by file.500GB in CSVs doesn't sound that big though. I'm guessing when you convert to Parquet (a 1-liner in DuckDB, below) it might end up being 50GBs or so.
COPY (FROM '/data/*.csv') TO 'my.parquet' (FORMAT PARQUET);
The partitioning of partquet files might be an issue as not all data are neatly partitioned by date. We have trades with different execution dates, clearance dates and other date values that we need query on.
If date partitioning doesn’t work, just find another chunking key. The key is to get it into parquet format. CSV is just hugely inefficient.
Or spin up a larger compute instance with more memory. I have 256gb on mine.
I tried running an Apache Spark job (8 machine cluster) on a data lake of 300 Gb of TSVs once. This was a distributed cluster. There was one join in it. It timed out after 8 hours. I realized why — Spark had to do many full table scans of the TSVs and it was just so inefficient. CSV formats are ok for straight up reads, but any time you have to do analytics operations like aggregate or join them at scale, you’re in for a world of pain.
DuckDB has better CSV handling than Spark but a large dataset in a poor format will stymie any engine.
The problem with disk based partition is keys are difficult to manage properly.
Btw I'm not suggesting to use Spark. I'm saying that even Spark didn't work on large TSV datasets (it only takes a JOIN or GROUP BY to kill the query performance). The CSV data storage format is simply the wrong one for analytics.
Partitioning is irreversible, but coming up with a thoughtful scheme isn't that hard. You just need to hash something. Even something as simple as a HNV hash on some meaningful field is sufficient. In one of my datasets, I chunk it by week, then by HNV modulo 50 chunks, so it looks like this:
/yearwk=202501/chunk=24/000.parquet
Ask an LLM to suggest partioning scheme or think of one.
CSV is the mistake. The move here is to get out of CSV. Partitioning is secondary -- partitioning here is only used for chunking the Parquet, nothing else. You are not locked into anything.
I have to think about partition, Spark/Athena both had issues with partitioning by received date. They are scanning way too much data.
You need to convert it into Parquet or some columnar format that lets engines do predicate pushdowns and fast scans. Each parquet file stores statistics about the data it contains so engines can quickly decide if it’s worth reading the file or skipping it altogether.
What DuckDB brings to the table is cloud-native formats. This is standard geospatial functionality attached to an object store instead of a disk. As such, it doesn't require running a database process - data is always "at rest" and available over HTTP. I'm not downplaying the accomplishment, it's really convenient. But know that this is a repackaging of existing tech to work efficiently within a cloud IO environment. If anything, the major innovation of DuckDB is in data management not geospatial per se.
DuckDB handles floating point numbers too - is DuckDB the most important thing in floating point data? Of course not, the data types and operators haven't changed. The underlying compute environment has. That's where the innovation is. I'd simply appreciate if technical writers took two seconds to make this vital distinction - being precise isn't hard and I don't know what we gain by publishing intentionally muddled technical articles.
I work with polars, but I haven't delved too deep into the performance characteristics versus postgres or anything like that.
It's API is not very well covered, some parts are still missing, but it must just be a matter of time.
Where it shines is when you need to do an O(n2) or O(nm) type of calculation. Then those 100s of free CPU cores really come in handy! And the end result is pretty often a dollar for CPU-days worth of computation.
Example of O(nm) calculation are things like finding the closest road segment inside a tile (or more likely a tile and it's surrounding tiles), for each point in a list.
https://dbeaver.com/docs/dbeaver/Working-with-Spatial-GIS-da...
The big change, in my view, over the past decade in GIS software, is in compute and storage efficiency across the typical stack. DuckDB has become a part of this, but h/t to the advances from shapely, geopandas, geoparquet, and GDAL. There's a lot of overlap in that venn diagram, and credit should be spread around. QGIS is great, too, though I feel there is market opportunity to apply 90/10 to its massive feature set and move it to the web.
fithisux•3d ago