frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Why DuckDB is my first choice for data processing

https://www.robinlinacre.com/recommend_duckdb/
78•tosh•7h ago

Comments

DangitBobby•3h ago
Being able to use SQL on CSV and json/jsonl files is pretty sweet. Of course it does much more than that, but that's what I do most often with it. Love duckdb.
samuell•2h ago
Indeed! I generally like awk a lot for simpler CSV/TSV processing, but when it comes to cases where you need things like combining/joining multiple CSV files or aggregating for certain columns, SQL really shines IME.
oulu2006•3h ago
That's really interesting, I love the idea of being able to use columnar support directly within postgresql.

I was thinking of using Citus for this, but possibly using duckdb is a better way to do. Citus comes with a lot more out of the box but duckdb could be a good stepping stone.

biophysboy•1h ago
Its a really handy tool. I've queried basically everything you can w/ duckdb - csv, json, s3 buckets, MS SQL servers, excel sheets, pandas dataframes, etc - and have had very few issues.
clumsysmurf•2h ago
DuckDB has experimental builds for Android ... I'm wondering how much work it would take to implement a Java API for it similar to sqlite (Cursor, etc).
jtbaker•38m ago
Something different than https://duckdb.org/docs/stable/clients/java?
smithclay•1h ago
Agree with the author, will add: duckdb is an extremely compelling choice if you’re a developer and want to embed analytics in your app (which can also run in a web browser with wasm!)

Think this opens up a lot of interesting possibilities like more powerful analytics notebooks like marimo (https://marimo.io/) … and that’s just one example of many.

canadiantim•1h ago
The wasm is pretty heavy data-wise tho, I’m hoping eventually it’ll be lighter for easier loading on not so good devices.
tjchear•1h ago
I’ve not used duckdb before nor do I do much data analysis so I am curious about this one aspect of processing medium sized json/csv with it: the data are not indexed, so any non-trivial query would require a full scan. Is duckdb so fast that this is never really a problem for most folks?
biophysboy•1h ago
Zonemaps are created for columns automatically. I process somewhat large tables w/ duckdb regularly (100M rows) and never have any problems.
riku_iki•49m ago
that's true for duckdb native tables, but the question was about json.
akhundelar•1h ago
Not a duckdb user, but I use polars a lot (mentioned in the article).

Depends on your definition of medium sized, but for tables of hundreds of thousands of rows and ~30 columns, these tools are fast enough to run queries instantly or near instantly even on laptop CPUs.

mpalmer•1h ago
I guess the question is: how much is medium? DuckDB can handle quite a lot of data without breaking a sweat. Certainly if you prefer writing SQL for certain things, it's a no-brainer.
simlevesque•1h ago
But when indexing your json or csv, if you have say 10 rows, each row is separated on your disk instead of all together. So a scan for one columb only needs to read a tenth of the disk space used for the data. Obviously this depends on the columns' content.
gdulli•37m ago
But you can have a surprisingly large amount of data before the inefficiency you're talking about becomes untenable.
ayhanfuat•41m ago
If you are going to query it frequently then json/csv might become an issue. I think the reason it doesn't become a problem for duckdb/polars users is that we generally convert them to parquet after first read.
RobinL•38m ago
It is true that for json and csv you need a full scan but there are several mitigations.

The first is simply that it's fast - for example, DuckDB has one of the best csv readers around, and it's parallelised.

Next, engines like DuckDB are optimised for aggregate analysis, where your single query processes a lot of rows (often a significant % of all rows). That means that a full scan is not necessarily as big a problem as it first appears. It's not like a transactional database where often you need to quickly locate and update a single row out of millions.

In addition, engines like DuckDB have predicate pushdown so if your data is stored in parquet format, then you do not need to scan every row because the parquet files themselves hold metadata about the values contained within the file.

Finally, when data is stored in formats like parquet, it's a columnar format, so it only needs to scan the data in that column, rather than needing to process the whole row even though you may be only interested in one or two columns

biophysboy•1h ago
I think my favorite part of duckdb is its flexibility. Its such a handly little swiss army knife for doing analytical processing in scientific environments (messy data w/ many formats).
s-a-p•1h ago
"making DuckDB potentially a suitable replacement for lakehouse formats such as Iceberg or Delta lake for medium scale data" > I'm a Data Engineering noob, but DuckDB alone doesn't do metadata & catalog management, which is why they've also introduce DuckLake.

Related question, curious as to your experience with DuckLake if you've used it. I'm currently setting up s3 + Iceberg + duckDB for my company (startup) and was wondering what to pick between Iceberg and DuckLake.

biophysboy•1h ago
DuckLake is pretty new, so I guess it would depend on if you need a more mature, fully-featured app.
pattar•35m ago
I went to a talk by the Motherduck team about why they built DuckLake instead of leaning more in on Iceberg. The key takeaway is that instead of storing all the table metadata inside files on s3 and dealing with latency and file io they instead store all of that info inside a duckdb table. Seems like a good idea and worked smoothly when I tried, however it is not quite in a stable production state it is still <1.0. They have a nice talk about it on youtube: https://youtu.be/hrTjvvwhHEQ?si=WaT-rclQHBxnc9qV
nchagnet•3m ago
We're using ducklake with data storage on Google cloud storage and the catalog inside a postgres database and it's a breeze! It may not be the most mature product, but it's definitely a good setup for small to medium applications which still require a data lake.
noo_u•1h ago
I'd say the author's thoughts are valid for basic data processing. Outside of that, most of claims in this article, such as:

"We're moving towards a simpler world where most tabular data can be processed on a single large machine1 and the era of clusters is coming to an end for all but the largest datasets."

become very debatable. Depending on how you want to pivot/ scale/augment your data, even datasets that seemingly "fit" on large boxes will quickly OOM you.

The author also has another article where they claim that:

"SQL should be the first option considered for new data engineering work. It’s robust, fast, future-proof and testable. With a bit of care, it’s clear and readable." (over polars/pandas etc)

This does not map to my experience at all, outside of the realm of nicely parsed datasets that don't require too much complicated analysis or augmentation.

hnthrowaway0315•58m ago
SQL is popular because everyone can learn and start using it after a while. I agree that Python sometimes is a better tool but I don't see SQL going away anytime.

From my experience, the data modelling side is still overwhelmingly in SQL. The ingestion side is definitely mostly Python/Scala though.

RobinL•26m ago
Author here. Re: 'SQL should be the first option considered', there are certainly advantages to other dataframe APIs like pandas or polars, and arguably any one is better in the moment than SQL. At the moment Polars is ascendent and it's a high quality API.

But the problem is the ecosystem hasn't standardised on any of them, and it's annoying to have to rewrite pipelines from one dataframe API.

I also agree you're gonna hit OOM if your data is massive, but my guess is the vast majority of tabular data people process is <10GB, and that'll generally process fine on a single large machine. Certainly in my experience it's common to see Spark being used on datasets that are no where big enough to need it. DuckDB is gaining traction, but a lot of people still seem unaware how quickly you can process multiple GB of data on a laptop nowadays.

I guess my overall position is it's a good idea to think about using DuckDB first, because often it'll do the job quickly and easily. There are a whole host of scenarios where it's inappropriate, but it's a good place to start.

mrtimo•56m ago
What I love about duckdb:

-- Support for .parquet, .json, .csv (note: Spotify listening history comes in a multiple .json files, something fun to play with).

-- Support for glob reading, like: select * from 'tsa20*.csv' - so you can read hundreds of files (any type of file!) as if they were one file.

-- if the files don't have the same schema, union_by_name is amazing.

-- The .csv parser is amazing. Auto assigns types well.

-- It's small! The Web Assembly version is 2mb! The CLI is 16mb.

-- Because it is small you can add duckdb directly to your product, like Malloy has done: https://www.malloydata.dev/ - I think of Malloy as a technical persons alternative to PowerBI and Tableau, but it uses a semantic model that helps AI write amazing queries on your data. Edit: Malloy makes SQL 10x easier to write because of its semantic nature. Malloy transpiles to SQL, like Typescript transpiles to Javascript.

lvl155•38m ago
I use DuckDB for nearly everything data related because it’s simply the most scalable format/tool/lib from a handful of rows to billions. It can handle it and I don’t need to do anything special on top. I’ve been a big proponent of doing everything in SQL for awhile now. It’s the most recyclable AND least error prone way of working with data.
majkinetor•9m ago
Anybody with experience in using duckdb to quickly select page of filtered transactions from the single table having a couple of billions of records and let's say 30 columns where each can be filtered using simple WHERE clausule? Lets say 10 years of payment order data. I am wondering since this is not analytical scenario.

Doing that in postgres takes some time, and even simple count(*) takes a lot of time (with all columns indexed)

film42•7m ago
Just 10 minutes ago I was working with a very large semi-malformed excel file generated by a mainframe. DuckDB was able to load it with all_varchar (just keep everything a string) in under a second.

I'm still waiting for Excel to load the file.

Cloudflare acquires Astro

https://astro.build/blog/joining-cloudflare/
394•todotask2•3h ago•214 comments

6-Day and IP Address Certificates Are Generally Available

https://letsencrypt.org/2026/01/15/6day-and-ip-general-availability
144•jaas•2h ago•57 comments

Michelangelo's first painting, created when he was 12 or 13

https://www.openculture.com/2026/01/discover-michelangelos-first-painting.html
169•bookofjoe•4h ago•105 comments

STFU

https://github.com/Pankajtanwarbanna/stfu
16•tanelpoder•26m ago•1 comments

Just the Browser

https://justthebrowser.com/
342•cl3misch•5h ago•177 comments

Launch HN: Indy (YC S21) – A support app designed for ADHD brains

https://www.shimmer.care/indy-redirect
29•christalwang•1h ago•25 comments

Lock-Picking Robot

https://github.com/etinaude/Lock-Picking-Robot
130•p44v9n•4d ago•55 comments

Zep AI (Agent Context Engineering, YC W24) Is Hiring Forward Deployed Engineers

https://www.ycombinator.com/companies/zep-ai/jobs/
1•roseway4•58m ago

Canada slashes 100% tariffs on Chinese EVs to 6%

https://electrek.co/2026/01/16/canada-breaks-with-us-slashes-100-tariffs-chinese-evs/
149•1970-01-01•53m ago•125 comments

Can You Disable Spotlight and Siri in macOS Tahoe?

https://eclecticlight.co/2026/01/16/can-you-disable-spotlight-and-siri-in-macos-tahoe/
40•chmaynard•3h ago•23 comments

Read_once(), Write_once(), but Not for Rust

https://lwn.net/SubscriberLink/1053142/8ec93e58d5d3cc06/
57•todsacerdoti•2h ago•17 comments

Cursor's latest "browser experiment" implied success without evidence

https://embedding-shapes.github.io/cursor-implied-success-without-evidence/
62•embedding-shape•3h ago•38 comments

psc: The ps utility, with an eBPF twist and container context

https://github.com/loresuso/psc
49•tanelpoder•4h ago•17 comments

Training my smartwatch to track intelligence

https://dmvaldman.github.io/rooklift/
96•dmvaldman•1d ago•39 comments

OpenBSD-current now runs as guest under Apple Hypervisor

https://www.undeadly.org/cgi?action=article;sid=20260115203619
363•gpi•14h ago•46 comments

Why DuckDB is my first choice for data processing

https://www.robinlinacre.com/recommend_duckdb/
80•tosh•7h ago•29 comments

Show HN: 1Code – Open-source Cursor-like UI for Claude Code

https://github.com/21st-dev/1code
10•Bunas•22h ago•0 comments

List of individual trees

https://en.wikipedia.org/wiki/List_of_individual_trees
286•wilson090•17h ago•100 comments

Dell UltraSharp 52 Thunderbolt Hub Monitor

https://www.dell.com/en-us/shop/dell-ultrasharp-52-thunderbolt-hub-monitor-u5226kw/apd/210-bthw/m...
4•cebert•45m ago•1 comments

Interactive eBPF

https://ebpf.party/
157•samuel246•9h ago•7 comments

Elasticsearch Was Never a Database

https://www.paradedb.com/blog/elasticsearch-was-never-a-database
6•jamesgresql•4d ago•9 comments

Pocket TTS: A high quality TTS that gives your CPU a voice

https://kyutai.org/blog/2026-01-13-pocket-tts
574•pain_perdu•1d ago•133 comments

Exasol Personal – Democratizing Big Data Analytics

https://www.exasol.com/blog/introducing-exasol-personal/
5•astigsen•4d ago•2 comments

Show HN: mdto.page – Turn Markdown into a shareable webpage instantly

https://mdto.page
25•hjinco•5h ago•15 comments

Briar keeps Iran connected via Bluetooth and Wi-Fi when the internet goes dark

https://briarproject.org/manual/fa/
521•us321•22h ago•327 comments

The spectrum of isolation: From bare metal to WebAssembly

https://buildsoftwaresystems.com/post/guide-to-execution-environments/
73•ThierryBuilds•8h ago•24 comments

Boeing knew of flaw in part linked to UPS plane crash, NTSB report says

https://www.bbc.com/news/articles/cly56w0p9e1o
243•1659447091•13h ago•118 comments

Inside The Internet Archive's Infrastructure

https://hackernoon.com/the-long-now-of-the-web-inside-the-internet-archives-fight-against-forgetting
418•dvrp•2d ago•98 comments

Ask HN: How can we solve the loneliness epidemic?

720•publicdebates•1d ago•1130 comments

Linux boxes via SSH: suspended when disconected

https://shellbox.dev/
277•messh•21h ago•142 comments