frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Use DuckDB-WASM to query TB of data in browser

https://lil.law.harvard.edu/blog/2025/10/24/rethinking-data-discovery-for-libraries-and-digital-humanities/
237•mlissner•3mo ago

Comments

mlissner•3mo ago
OK, this is really neat: - S3 is really cheap static storage for files. - DuckDB is a database that uses S3 for its storage. - WASM lets you run binary (non-JS) code in your browser. - DuckDB-Wasm allows you to run a database in your browser.

Put all of that together, and you get a website that queries S3 with no backend at all. Amazing.

timeflex•3mo ago
S3 might be relatively cheap for storing files, but with bandwidth you could easily be paying $230/mo. If you make it public facing & want to try to use their cloud reporting, metrics, etc. to prevent people for running up your bandwidth, your "really cheap" static hosting could easily cost you more than $500/mo.
theultdev•3mo ago
R2 is S3 compatible with no egress fees.

Cloudflare actually has built in iceberg support for R2 buckets. It's quite nice.

Combine that with their pipelines it's a simple http request to ingest, then just point duckdb to the iceberg enabled R2 bucket to analyze.

greatNespresso•3mo ago
Was about to jump in to say the same thing. R2 is a much cheaper alternative to S3 that just works and I have used it with DuckDB, works smoothly
apwheele•3mo ago
For a demo of this (although not sure with duckdb wasm that it works with iceberg) https://andrewpwheeler.com/2025/06/29/using-duckdb-wasm-clou...
8organicbits•3mo ago
> R2 is S3 compatible with no egress fees.

There's no egress data transfer fees, but you still pay for the GET request operations. Lots of little range requests can add up quick.

zenmac•3mo ago
Can't believe that is what the industry has come down to. Kind like clipping coupon to get the best deal according different pricing overlords.

It is time like this that makes self-hosting a lot more attractive.

theultdev•3mo ago
Luckily it's just static files. You can use whatever host you want.
7952•3mo ago
I think this approach makes sense for services with a small number of users relative to the data they are searching. That just isn't a good fit for a lot of hosted services. Think how much that TB's of data would cost on Algolia or similar services.

You have to store the data somehow anyway, and you have to retrieve some of it to service a query. If egress costs too much you could always change later to put the browser code on a server. Also it would presumably be possible to quantify the trade-off between processing the data client side and on the server.

simonw•3mo ago
Stick it behind Cloudflare and it should be effectively free.
bigiain•3mo ago
Until it isn't.
rubenvanwyk•3mo ago
Or use R2 instead. It’s even easier.
thadt•3mo ago
S3 is doing quite a lot of sophisticated lifting to qualify as no backend at all.

But yeah - this is pretty neat. Easily seems like the future of static datasets should wind up in something like this. Just data, with some well chosen indices.

theultdev•3mo ago
Still qualifies imo. Everything is static and on a CDN.

Lack of server/dynamic code qualifies as no backend.

simonw•3mo ago
I believe all S3 has to do here is respond to HTTP Range queries, which are supported by almost every static server out there - Apache, Nginx etc should all support the same trick.
thadt•3mo ago
100%. I’m with y’all - this is what I would also call a “no-backend” solution and I’m all in on this type of approach for static data sets - this is the future, and could be served with a very simple web server.

I’m just bemused that we all refer to one of the larger, more sophisticated storage systems on the plant, composed of dozens of subsystems and thousands of servers as “no backend at all.” Kind of a “draw the rest of the owl”.

codedokode•3mo ago
Can you replace S3 with a directory and nginx and save lot of money?
dtech•3mo ago
Yes, i.i.r.c. it's not S3 specific just URLs
mpweiher•3mo ago
Yes. Especially if you use Storage Combinators.

They let you easily abstract over storage.

https://2019.splashcon.org/details/splash-2019-Onward-papers...

amazingamazing•3mo ago
Neat. Can you use duckdb backed on another store like rocksdb or something? Also, I wonder how one stops ddos. Put the whole thing behind Cloudflare?
wewewedxfgdf•3mo ago
I tried DuckDB - liked it a lot - was ready to go further.

But found it to be a real hassle to help it understand the right number of threads and the amount of memory to use.

This led to lots of crashes. If you look at the projects github issues you will see many OOM out of memory errors.

And then there was some indexed bug that crashed seemingly unrelated to memory.

Life is too short for crashy database software so I reluctantly dropped it. I was disappointed because it was exactly what I was looking for.

lalitmaganti•3mo ago
+1 this was my experience trying it out as well. I find that for getting started and for simple usecases it works amazing. But I have quite a lot of concerns about how it scales to more complex and esoteric workloads.

Non-deterministic OOMs especially are some of the worst things in the sort of tools I'd want to use DuckDB in and as you say, I found it to be more common than I would like.

tuhgdetzhh•3mo ago
I can recommend earlyoom (https://github.com/rfjakob/earlyoom). Instead of freezing or crashing your system this tool kills the memory eating process just in time (in this case duckdb). This allows you repeat with smaller chunks of the dataset, until it fits into your mem.
wewewedxfgdf•3mo ago
Yeah memory and thread management is the job of the application, not me.
QuantumNomad_•3mo ago
When I there is a specific program I want to run with a limit on how much memory it is allowed to allocate, I have found systemd-run to work well.

It uses cgroups to enforce resource limits.

For example, there’s a program I wrote myself which I run on one of my Raspberry Pi. I had a problem where my program would on rare occasions use up too much memory and I wouldn’t even be able to ssh into the Raspberry Pi.

I run it like this:

  systemd-run --scope -p MemoryMax=5G --user env FOOBAR=baz ./target/release/myprog
The only difficulty I had was that I struggled to find the right name to use in the MemoryMax=… part because they’ve changed the name of it around between versions so different Linux systems may or may not use the same name for the limit.

In order to figure out if I had the right name for it, I tested different names for it with a super small limit that I knew was less than the program needs even in normal conditions. And when I found the right name, the program would as expected be killed right off the bat and so then I could set the limit to 5G (five gigabytes) and be confident that if it exceeds that then it will be killed instead of making my Raspberry Pi impossible to ssh into again.

thenaturalist•3mo ago
This looks amazing!

Have you used this in conjunction with DuckDB?

tuhgdetzhh•3mo ago
Yes, it works just fine.
mritchie712•3mo ago
what did you use instead? if you hit OOM with the dataset in duckdb, I'd think you'd hit the OOM with most other things on the same machine.
wewewedxfgdf•3mo ago
The software should manage its own memory not require the developer to set specific memory thresholds. Sure, a good thing to be able to say "use no more than X RAM".
thenaturalist•3mo ago
How long ago was this, or can you share more context about data and mem size you experienced this with?

DuckDB has introduced spilling to disk and some other tweaks since a good year now: https://duckdb.org/2024/07/09/memory-management

wewewedxfgdf•3mo ago
3 days ago.

The final straw was an index which generated fine on MacOS and failed on Linux - exact same code.

Machine had plenty of RAM.

The thing is, it is really the responsibility of the application to regulate its behavior based on available memory. Crashing out just should not be an option but that's the way DuckDB is built.

alex-korr•3mo ago
I had the same experience - everything runs great on an AWS Linux EC2 with 32GB of memory, same workload in a docker on ECS with 32GB allocated gets an OOM. But for smaller workloads, DuckDB is fantastic... however, there's a certain point when Spark or Snowflake start to make more sense.
jdnier•3mo ago
Yesterday there was a somewhat similar DuckDB post, "Frozen DuckLakes for Multi-User, Serverless Data Access". https://news.ycombinator.com/item?id=45702831
85392_school•3mo ago
This also reminded me of an approach using SQLite: https://news.ycombinator.com/item?id=45748186
pacbard•3mo ago
I set up something similar at work. But it was before the DuckLake format was available, so it just uses manually generated Parquet files saved to a bucket and a light DuckDB catalog that uses views to expose the parquet files. This lets us update the Parquet files using our ETL process and just refresh the catalog when there is a schema change.

We didn't find the frozen DuckLake setup useful for our use case. Mostly because the frozen catalog kind of doesn't make sense with the DuckLake philosophy and the cost-benefit wasn't there over a regular duckdb catalog. It also made making updates cumbersome because you need to pull the DuckLake catalog, commit the changes, and re-upload the catalog (instead of just directly updating the Parquet files). I get that we are missing the time travel part of the DuckLake, but that's not critical for us and if it becomes important, we would just roll out a PostgreSQL database to manage the catalog.

SteveMoody73•3mo ago
My initial thought is why query 1TB of data in a browser, maybe I'm the wrong target audience for this but it seems that it's pushing that everything has to be in a browser rather than using appropriate tools
cyanydeez•3mo ago
Browsers are now the write-once works everywhere target. Where java failed, many hope browsers succeed. WASM is definitely a key to that, particularly because it can be output by tools like rust, so they can also be the appropriate tools.
majormajor•3mo ago
Why pay for RAM for servers when you can let your users deal with it? ;)

(Does not seem like a realistic scenario to me for many uses, for RAM among other resource reasons.)

some_guy_nobel•3mo ago
The one word answer is cost.

But, if you'd like to instead read the article, you'll see that they qualify the reasoning in the first section of the article, titled, "Rethinking the Old Trade-Off: Cost, Complexity, and Access".

simonw•3mo ago
What appropriate tool would you use for this instead?
shawn-butler•3mo ago
I doubt they are querying 1 TB of data in the browser. DuckDB-WASM issues http range requests on behalf of client to request only the bytes required, especially handy with parquet files (columnar format) that will exclude columns you don't even need.

But the article is a little light on technical details. In some cases it might make sense to bring the entire file client-side.

fragmede•3mo ago
For small databases, SQLite is handy, as there are multiple ways to parse the format for clients.
r3tr0•3mo ago
It's one of the best tricks in the book.

We have been doing it for quite some time in our product to bring real time system observability with eBPF to the browser and have even found other techniques to really max-it-out beyond what you get off the shelf.

https://yeet.cx

mrbluecoat•3mo ago
That's pretty cool. Any technical blog posts?
r3tr0•3mo ago
we got a couple blog posts

https://yeet.cx/blog

leetrout•3mo ago
I built something on top of DuckDB last year but it never got deployed. They wanted to trust Postgres.

I didn't use the in browser WASM but I did expose an api endpoint that passed data exploration queries directly to the backend like a knock off of what new relic does. I also use that same endpoint for all the graphs and metrics in the UI.

DuckDB is phenomenal tech and I love to use it with data ponds instead of data lakes although it is very capable of large sets as well.

whalesalad•3mo ago
Cool thing about DuckDB is it can be embedded. We have a data pipeline that produces a duckdb file and puts it on S3. The app periodically checks that assets etag and pulls it down when it changes. Most of our DB interactions use PSQL, but we have one module that leverages DuckDB and this file for reads. So it's definitely not all-or-nothing.
zenmac•3mo ago
Are you using pg_duckdb to embedded it inside postgres and access it via psql or other pg clients?
victor106•3mo ago
> data ponds instead of data lakes

What are data ponds? Never heard the term before

leetrout•3mo ago
Haha, my term. Somewhere between a data lake and warehouse - still unstructured but not _everything_ in one place. For instance, if I have a multi-tenant app I might choose to have a duckdb setup for each customer with pre-filtered data living alongside some global unstructured data.

Maybe there's already a term that covers this but I like the imagery of the metaphor... "smaller, multiple data but same idea as the big one".

victor106•3mo ago
Got it, Thanks for the explanation.
didip•3mo ago
How… does it not blow up browser’s memory?
Copenjin•3mo ago
The UI element is a scrollable table with a fixed size viewport window, memory shouldn't be a problem since they just have to retrieve and cache a reasonable area around that window. Old data can just be discarded.
barrenko•3mo ago
Where do I learn how to set up this sort of stuff? Trial and error? I kinda never need it for personal projects (so far), which always leads me to forget this stuff in between jobs kinda quickly. Is there a decent book?
vikramkr•3mo ago
If you want to learn it the best way is probably to come up with a personal project idea that requires it specifically? Idk how much you'd get out of a book but you could always do a side project with the specific goal of doing it just to learn a particular stack or whatever
dtech•3mo ago
My company tried DuckDB-WASM + parquet + S3 a few months ago but we ended up stripping it all out and replacing it with a boring REST API.

On paper it seemed like a great fit, but it turned out the WASM build doesn't have feature-parity with the "normal" variant, so things that caused us to pick it like support for parquet compression and lazy loading were not supported. So it ended up not having great performance while introducing a lot of complexity, and also was terrible for first page load time due to needing the large WASM blob. Build pipeline complexity was also inherently higher due to the dependency and data packaging needed.

Just something to be aware of if you're thinking of using it. Our conclusion was that it wasn't worth it for most use cases, which is a shame because it seems like such a cool tech.

mentalgear•3mo ago
> WASM build doesn't have feature-parity with the "normal" variant

It's a good point, but the wasm docs state that feature-parity isn't there - yet. It could certainly be more detailed, but it seems strange that your company would do all this work without first checking the feature-coverage / specs.

> WebAssembly is basically an additional platform, and there might be platform-specific limitations that make some extensions not able to match their native capabilities or to perform them in a different way.

https://duckdb.org/docs/stable/clients/wasm/extensions

dtech•3mo ago
Note that your docs specifically mentions parquet was supported, but we found out the hard way some specific features turned out not to be supported with WASM + parquet. I did a quick glance at your docs and could not find references to that, so I'm not surprised it was missed.

It was a project that exploited a new opportunity so time-to-market was the most important thing, I'm not suprised these things were missed, and replacing the data loading mechanism was maybe 1 week of work for 1 person, so it wasn't that impactful a change later.

mentalgear•3mo ago
Fair point, thx for sharing your experiences ! You might want to edit the duck-wasm docs in that regard to alert others/the team of this constraint.
ludicrousdispla•3mo ago
DuckDB-WASM supports parquet file decompression though, so if you have a backend process generating them it's a non issue.

How large was your WASM build? I'm using the standard duckdb-wasm, along with JS functions to form the SQL queries, and not seeing onerous load times.

ngc6677•3mo ago
Also similar procedure used on joblist.today https://github.com/joblisttoday to fetch hiring companies and their jobs and store them into sqlite and duckdb, and retrieved on the client side with their wasm modules. The database are generated with a daily github workflow and hosted as artifact on a github page.
bzmrgonz•3mo ago
This is brilliant guys, omg this is brilliant. If you think about it, freely available data always suffer with this burden... "But but we don't make money, all this stuff is public data by law, and government doesn't give us a budget". This solves that, the "can't afford it" spirit of public agencies.

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

https://openciv3.org/
510•klaussilveira•8h ago•141 comments

The Waymo World Model

https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-frontier-for-autonomous-driving-simula...
849•xnx•14h ago•507 comments

How we made geo joins 400× faster with H3 indexes

https://floedb.ai/blog/how-we-made-geo-joins-400-faster-with-h3-indexes
61•matheusalmeida•1d ago•12 comments

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

https://github.com/valdanylchuk/breezydemo
168•isitcontent•9h ago•20 comments

Monty: A minimal, secure Python interpreter written in Rust for use by AI

https://github.com/pydantic/monty
171•dmpetrov•9h ago•77 comments

Show HN: I spent 4 years building a UI design tool with only the features I use

https://vecti.com
282•vecti•11h ago•127 comments

Dark Alley Mathematics

https://blog.szczepan.org/blog/three-points/
64•quibono•4d ago•11 comments

Microsoft open-sources LiteBox, a security-focused library OS

https://github.com/microsoft/litebox
340•aktau•15h ago•165 comments

Show HN: If you lose your memory, how to regain access to your computer?

https://eljojo.github.io/rememory/
228•eljojo•11h ago•142 comments

Sheldon Brown's Bicycle Technical Info

https://www.sheldonbrown.com/
333•ostacke•15h ago•90 comments

Hackers (1995) Animated Experience

https://hackers-1995.vercel.app/
425•todsacerdoti•16h ago•221 comments

Unseen Footage of Atari Battlezone Arcade Cabinet Production

https://arcadeblogger.com/2026/02/02/unseen-footage-of-atari-battlezone-cabinet-production/
4•videotopia•3d ago•0 comments

An Update on Heroku

https://www.heroku.com/blog/an-update-on-heroku/
365•lstoll•15h ago•253 comments

PC Floppy Copy Protection: Vault Prolok

https://martypc.blogspot.com/2024/09/pc-floppy-copy-protection-vault-prolok.html
35•kmm•4d ago•2 comments

Delimited Continuations vs. Lwt for Threads

https://mirageos.org/blog/delimcc-vs-lwt
11•romes•4d ago•1 comments

Show HN: ARM64 Android Dev Kit

https://github.com/denuoweb/ARM64-ADK
12•denuoweb•1d ago•1 comments

Why I Joined OpenAI

https://www.brendangregg.com/blog/2026-02-07/why-i-joined-openai.html
85•SerCe•4h ago•66 comments

How to effectively write quality code with AI

https://heidenstedt.org/posts/2026/how-to-effectively-write-quality-code-with-ai/
214•i5heu•11h ago•160 comments

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

https://github.com/phreda4/r3
59•phreda4•8h ago•11 comments

Introducing the Developer Knowledge API and MCP Server

https://developers.googleblog.com/introducing-the-developer-knowledge-api-and-mcp-server/
35•gfortaine•6h ago•9 comments

Female Asian Elephant Calf Born at the Smithsonian National Zoo

https://www.si.edu/newsdesk/releases/female-asian-elephant-calf-born-smithsonians-national-zoo-an...
16•gmays•4h ago•2 comments

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

https://infisical.com/blog/devops-to-solutions-engineering
123•vmatsiiako•13h ago•51 comments

Learning from context is harder than we thought

https://hy.tencent.com/research/100025?langVersion=en
160•limoce•3d ago•80 comments

Understanding Neural Network, Visually

https://visualrambling.space/neural-network/
258•surprisetalk•3d ago•34 comments

I now assume that all ads on Apple news are scams

https://kirkville.com/i-now-assume-that-all-ads-on-apple-news-are-scams/
1022•cdrnsf•18h ago•425 comments

FORTH? Really!?

https://rescrv.net/w/2026/02/06/associative
53•rescrv•16h ago•17 comments

Evaluating and mitigating the growing risk of LLM-discovered 0-days

https://red.anthropic.com/2026/zero-days/
44•lebovic•1d ago•13 comments

I'm going to cure my girlfriend's brain tumor

https://andrewjrod.substack.com/p/im-going-to-cure-my-girlfriends-brain
99•ray__•5h ago•49 comments

WebView performance significantly slower than PWA

https://issues.chromium.org/issues/40817676
14•denysonique•5h ago•1 comments

Show HN: Smooth CLI – Token-efficient browser for AI agents

https://docs.smooth.sh/cli/overview
81•antves•1d ago•59 comments