A sharded DuckDB on 63 nodes runs 1T row aggregation challenge in 5 sec

https://gizmodata.com/blog/gizmoedge-one-trillion-row-challenge

206•tanelpoder•15h ago

Comments

maxmcd•14h ago

Are there any open sourced sharded query planners like this? Something that can aggregate queries across many duckdb/sqlite dbs?

hobofan•14h ago

Not directly DuckDB (though I think it might be able to be connected to that), but I think Apache Datafusion Ballista[0] would be a typical modern open source benchmark here.

[0]: https://datafusion.apache.org/ballista/contributors-guide/ar...

mritchie712•14h ago

DeepSeek released smallpond

0 - https://github.com/deepseek-ai/smallpond

1 - https://www.definite.app/blog/smallpond (overview for data engineers, practical application)

tgv•14h ago

Impressive, but those 63 nodes were "Azure Standard E64pds v6 nodes, each providing 64 vCPUs and 504 GiB of RAM." That's 4000 CPUs and 30TB memory.

ralegh•14h ago

Just noting that 4000 vCPUs usually means 2000 cores, 4000 threads

electroly•14h ago

It doesn't mean that here. Epdsv6 is 1 core = 1 vCPU.

ralegh•11h ago

I stand corrected…

RamtinJ95•14h ago

At that scale it cannot be cheaper than just running the same workload on BigQuery or Snowflake or?

philbe77•14h ago

A Standard E64pds v6 costs: $3.744 / hr on demand. At 63 nodes - the cost is: $235.872 / hr - still cheaper than a Snowflake 4XL cluster - costing: 128 credits / hr at $3/credit = $384 / hr.

philbe77•14h ago

At 5 seconds - the query technically cost: $0.3276

Keyframe•13h ago

That's like calculating a trip cost based on gas cost without accounting for car rental, gas station food, and especially mandatory bathroom fee after said food.

philbe77•12h ago

If I used "spot" instances - it would have been 63 x $0.732/hr for a total of: $45.99 / hr.

ramraj07•13h ago

Sounds like the equivalent of a 4xl snowflake warehouse, which for such queries would take 30 seconds, with the added benefit of the data being cold stored in s3. Thus you only pay by the minute.

hobs•13h ago

No, that would be equivalent to 64 4xl snowflake warehouses (though the rest of your point still stands).

philbe77•13h ago

Cost-wise, 64 4xl Snowflake clusters would cost: 64 x $384/hr - for a total of: $24,576/hr (I believe)

__mharrison__•12h ago

What was the cost of the duck implementation?

ramraj07•8h ago

Apologize for getting it wrong a few orders of magnitude, but thats even more ghastly if its so overpowered and yet takes this long.

philbe77•13h ago

Challenge accepted - I'll try it on a 4XL Snowflake to get actual perf/cost

shinypenguin•14h ago

Is the dataset somewhere accessible? Does anyone know more about the "1T challenge", or is it just the 1B challenge moved up a notch?

Would be interesting to see if it would be possible to handle such data on one node, since the servers they are using are quite beefy.

philbe77•14h ago

Hi shinypenguin - the dataset and challenge are detailed here: https://github.com/coiled/1trc

The data is in a publicly accessible bucket, but the requester is responsible for any egress fees...

shinypenguin•14h ago

Hi, thank you for the link and quick response! :)

Do you know if anyone attempted to run this on the least amount of hardware possible with reasonable processing times?

philbe77•14h ago

Yes - I also had GizmoSQL (a single-node DuckDB database engine) take the challenge - with very good performance (2 minutes for $0.10 in cloud compute cost): https://gizmodata.com/blog/gizmosql-one-trillion-row-challen...

simonw•13h ago

I suggest linking to that from the article, it is a useful clarification.

philbe77•13h ago

Good point - I'll update it...

achabotl•12h ago

The One Trillion Row Challenge was proposed by Coiled in 2024. https://docs.coiled.io/blog/1trc.html

MobiusHorizons•14h ago

> Once trusted, each worker executes its local query through DuckDB and streams intermediate Arrow IPC datasets back to the server over secure WebSockets. The server merges and aggregates all results in parallel to produce the final SQL result—often in seconds.

Can someone explain why you would use websockets in an application where neither end is a browser? Why not just use regular sockets and cut the overhead of the http layer? Is there a real benefit I’m missing?

philbe77•14h ago

Hi MobiusHorizons, I happened to use websockets b/c it was the technology I was familiar with. I will try to learn more about normal sockets to see if I could perhaps make them work with the app. Thanks for the suggestion...

DanielHB•13h ago

if you really want maximum performance maybe consider using CoAP for node-communication:

https://en.wikipedia.org/wiki/Constrained_Application_Protoc...

It is UDP-based but adds handshakes and retransmissions. But I am guessing for your benchmark transmission overhead isn't a major concern.

Websockets are not that bad, only the initial connection is HTTP. As long as you don't create a ton of connections all the time it shouldn't be much slower than a TCP-based socket (purely theoretical assumption on my part, I never tested).

gopalv•12h ago

> will try to learn more about normal sockets to see if I could perhaps make them work with the app.

There's a whole skit in the vein of "What have the Romans ever done for us?" about ZeroMQ[1] which has probably lost to the search index now.

As someone who has held a socket wrench before, fought tcp_cork and dsack, Websockets isn't a bad abstraction to be on top of, especially if you are intending to throw TLS in there anyway.

Low level sockets is like assembly, you can use it but it is a whole box of complexity (you might use it completely raw sometimes like a tickle ack in the ctdb[2] implementation).

[1] - https://news.ycombinator.com/item?id=32242238

[2] - https://linux.die.net/man/1/ctdb

sureglymop•13h ago

Wait but websockets aren't over http right? Just the initiation and then there is a protocol upgrade or am I wrong? What overhead is there otherwise?

tsimionescu•13h ago

You're right, WebSockets aren't over HTTP, they just use HTTP for the connection initiation. They do add some overhead in two places: one, when opening a new connection, since you go TCP -> TLS -> HTTP -> WebSockets -> Your protocol ; and two, they do add some per packet overhead, since there is a WebSocket encapsulation of your data - but this is much smaller than typical HTTP request/response overhead.

kevincox•13h ago

> the overhead of the http layer

There isn't much overhead here other than connection setup. For HTTP/1 the connection is just "upgraded" to websockets. For HTTP/2 I think the HTTP layer still lives on a bit so that you can use connection multiplexing (which maybe be overhead if you have no use for it here) but that is still a very thin layer.

So I think the question isn't so much HTTP overhead but WebSocket overhead. WebSockets add a bit of message framing and whatnot that may be overhead if you don't need it.

In 99% of applications if you need encryption, authentication and message framing you would be hard-pressed to find a significantly more efficient option.

toast0•12h ago

> In 99% of applications if you need encryption, authentication and message framing you would be hard-pressed to find a significantly more efficient option.

AFAIK, websockets doesn't do authentication? And the encryption it does is minimal, optional xor with a key disclosed in the handshake. It does do framing.

It's not super common, but if all your messages have a 16-bit length, you can just use TLS framing. I would argue that TLS framing is ineffecient (multiple length terms), but using it by itself is better than adding a redundant framing layer.

But IMHO, there is significant benefit from removing a layer where it'd unneeded.

LunaSea•11h ago

> AFAIK, websockets doesn't do authentication?

Websocket allows for custom header and query parameters which make it possible to run a basic authentication scheme and later on additional autorisation in the message themselves if really necessary.

> And the encryption it does is minimal, optional xor with a key disclosed in the handshake. It does do framing.

Web Secure Socket (WSS) is the TLS encrypted version of Websockets (WS) (similar to HTTP vs. HTTPS).

fyrn_•11h ago

Worth noting that wbesockets in the browser don't allow custom headers and custom header support is spotty accross sever impls. It's just not exposed in the javascript API. There has been an open chrome bug for that for like 15 years

LunaSea•9h ago

> Worth noting that wbesockets in the browser don't allow custom headers

They do during the initial handshake (protocol upgrade from HTTP to WebSocket).

Afterwards the message body can be used to send authorisation data.

Server support will depend on tech but Node.js has great support.

fyrn_•2h ago

https://github.com/whatwg/websockets/issues/16

No, I don't think you get it. `new Websocket()` from JS takes no arguments for headers. You literally can't send headers during the handshake from JS. https://developer.mozilla.org/en-US/docs/Web/API/WebSocket/W...

Actually will look into using the subprotocol as a way to do auth, but most impls in the wild send the auth as the first message.

The fact the protocol in theory supports it doesn't really matter much since no browser implements that part of the spec.

simonw•13h ago

If you're using sockets you still need to come up with some kind of protocol on top of those sockets for the data that's being transferred - message delimiter, a data format etc. Then you have to build client libraries for that protocol.

WebSockets solve a bunch of those low level problems for you, in a well specified way with plenty of existing libraries.

HumblyTossed•13h ago

ASCII table codes 1,2,3 & 4 pretty simple to use.

jcheng•13h ago

Not if you're passing binary data

woodruffw•12h ago

Even beyond that: the ASCII delimiter control codes are perfectly valid UTF-8 (despite not being printable), so using them for in-band signaling is a recipe for pain on arbitrary UTF-8 data.

mananaysiempre•12h ago

If you know your data is UTF-8, then bytes 0xFE and 0xFF are guaranteed to be free. Strictly speaking, 0xC0, 0xC1, and 0xF5 through 0xFD also are, but the two top values are free even if you are very lax and allow overlong encodings as well as codepoints up to 2³² − 1.

woodruffw•11h ago

I think it would probably be better to invest in a proper framing design than trying to poke holes in UTF-8.

(This is true regardless of UTF-8 -- in-band encodings are almost always brittle!)

dns_snek•12h ago

Sure, in principle. Someone already mentioned binary data, then you come up with a framing scheme and get to write protocol documentation, but why? What's the benefit?

HumblyTossed•11h ago

Simplicity.

ryanjshaw•11h ago

You misspelled “bugs and maintenance nightmare”

tracker1•11h ago

Now solve for encryption, authorization, authentication...

WS(S) has in the box solutions for a lot of these... on top of that, application gateways, distribution, failover etc. You get a lot of already solved solutions in the box, so to speak. If you use raw sockets, now you have to implement all of these things yourself, and you aren't gaining much over just using WSS.

zerd•12h ago

WebSocket doesn't specify data format, it's just bytes, so they have to handle that themselves. It looks like they're using Arrow IPC.

Since they're using Arrow they might look into Flight RPC [1] which is made for this use case.

[1] https://arrow.apache.org/docs/format/Flight.html

lucideer•13h ago

> overhead of the http layer

Detail of this well-covered in sibling comments, but at a higher-level, two thoughts on this:

1. I see a lot of backlash lately against everything being HTTP-ified, with little justification other than a presumption that it necessarily adds overhead. Perf-wise, HTTP has come a long way & modern HTTP is a very efficient protocol. I think this has cleared the way for it to be a foundation for many more things than in the past. HTTP/3 being over UDP might clear the way for more of this (albeit I think the overhead of TCP/IP is also often overstated - see e.g. MQTT).

2. Overhead can be defined in two ways: perf. & maintenance complexity. Modern HTTP does add a bit of the latter, so in that context it may be a fair concern, but I think the large range of competing implementations probably obviates any concern here & the alternative usually involves doing something custom (albeit simpler), so you run into inconsistency, re-invented wheels & bus factor issues there.

fweimer•13h ago

One reason comes to my mind: HTTP is no longer a stable protocol with well-understood security properties. If you deploy it today, people expect interoperability with clients and servers that implement future protocol upgrades, resulting in an ongoing maintenance burden that a different protocol choice would avoid.

lucideer•12h ago

I'm absolutely not an expert of any kind on protocol details, so pardon my ignorance here but this surprises me: is this true?

High-level spec changes have been infrequent, with long dual support periods, & generally seen pretty slow gradual client & server adoption. 1.1 was 1997 & continues to have widespread support today. 2 & 3 were proposed in 2015 & 2016 - almost 2 decades later - & 2 is only really starting to see wide support today, with 3 still broadly unsupported.

I'm likely missing a lot of nuance in between versioned releases though - I know e.g. 2 saw at least two major additions/updates, though I thought those were mostly additive security features rather than changes to existing protocol features.

immibis•12h ago

Using stuff like HTTP signals a lack of understanding of the whole stack. IMO it's important for programmers to understand computers. You can write programs without understanding computers, but it's best if you go and learn about computers first. You can use abstractions but you should also understand the abstractions.

There are two ways I've noticed to design an application.

Some people grab some tools out of their toolbox that look like they fit - I need a client/server, I know web clients/servers, so I'll use a web client/server.

Other people think about what the computer actually has to do and then write code to achieve that: Computer A has to send a block of data to computer B, and this has to work on Linux (which means no bit-banging - you can only go as low as raw sockets). This type of person may still take shortcuts, but it's by intention, not because it's the only thing they know: if HTTP is only one function call in Python, it makes sense to use HTTP, not because it's the only thing you know but because it's good enough, you know it works well enough for this problem, and you can change it later if it becomes a bottleneck.

Websockets are an odd choice because they're sort of the worst of both worlds: they're barely more convenient as raw sockets (there's framing, but framing is easy), but they also add a bunch of performance and complexity overhead over raw sockets, and more things that can go wrong. So it doesn't seem to win on the convenience/laziness front nor the performance/security/robustness front. If your client had to be a web browser, or could sometimes be a web browser, or if you wanted to pass the connections through an HTTP reverse proxy, those would be good reasons to choose websockets, but none of them are the case here.

lucideer•12h ago

Acknowledging that a huge number of people (the vast majority) are going to use the only option they know rather than the best of a set of options they know, I still think that for a person who's fully versed in all available options, Websockets is a better option than you make out.

> they're barely more convenient as raw sockets

Honestly, raw sockets are pretty convenient - I'm not convinced Websockets are more convenient at all (assuming you already know both & there's no learning curves). Raw sockets might even be more convenient.

I think it's features rather than convenience that is more likely to drive Websocket usage when comparing the two.

> they also add a bunch of performance and complexity overhead over raw sockets

This is the part that I was getting at in my above comment. I agree in theory, but I just think that the "a bunch" quantifier is bit of an exaggeration. They really add very very little performance overhead in practice: a negligible amount in most cases.

So for a likely-negligible performance loss, & a likely-negligible convenience difference, you're getting a protocol with built-in encryption, widespread documentation & community support - especially important if you're writing code that other people will need to take over & maintain - & as you alluded to: extensibility (you may never need browser support or http proxying, but having the option is compelling when the trade-offs are so negligible).

gr4vityWall•7h ago

> they're barely more convenient as raw sockets (there's framing, but framing is easy)

I think it's significant more convenient if your stack touches multiple programming languages. Otherwise you'd have to implement framing yourself for all of them. Not hard, but I don't see the benefit either.

> they also add a bunch of performance and complexity overhead over raw sockets

What performance overhead is there over raw sockets once you're past the protocol upgrade? It seems negligible if you connection is even slightly long-lived.

nurettin•12h ago

http 101 upgrade isn't much of an overhead and there are tried and tested websocket/ssl libraries with pretty callback interfaces versus your custom binary protocol. I would still choose the latter but I wouldn't recommend it.

immibis•8h ago

you can apply this reasoning to a lot of protocols, like why not use Nostr over websockets? I mean, I don't see any reason to do this with Nostr over websockets, but also, why not? it's not much overhead right?

stevenhuang•3h ago

Comparing ws to nostr shows you might not understand how ws actually works. You realize after connection setup it's just a tcp socket? It's not framed by http headers if that's what you're wondering. The ws frame is like 6 bytes.

lantastic•7h ago

Others pointed plenty of arguments, but the ones I find most compelling (not necessarily useful in this context) are:

- you can serve any number of disjoint websocket services via same port via HTTP routing - this also means you can do TLS termination in one place, so downstream websocket service doesn't have to deal with the nitty-gritty of certificates.

Sure, it adds a hop compared to socket passing, and there are ways to get similar fanout with TCP with a custom protocol. But you need to add this to every stack that interacting components use, while websockets libraries exist for most languages that are likely to be used in such an endeavor.

nodesocket•13h ago

> Each GizmoEdge worker pod was provisioned with 3.8 vCPUs (3800 m) and 30 GiB RAM, allowing roughly 16 workers per node—meaning the test required about 63 nodes in total.

How was this node setup chosen? Specially 3.8 vCPU and 30 GiB RAM per? Why not just run 16 workers total using the entire 64 vCPU and 504 GiB of memory each?

philbe77•13h ago

Hi nodesocket - I tried to do 4 CPUs per node, but Kubernetes takes a small (about 200m) CPU request amount for daemon processes - so if you try to request 4 (4000m) CPUs x 16 - you'll spill one pod over - fitting only 15 per node.

I was out of quota in Azure - so I had to fit in the 63 nodes... :)

nodesocket•13h ago

But why split up a vm into so many workers instead of utilizing the entire vm as a dedicated single worker? What’s the performance gain and strategy?

philbe77•13h ago

I'm not exactly sure yet. My goal was to not have the shards be too large so as to be un-manageable. In theory - I could just have had 63 (or 64) huge shards - and 1 worker per K8s node, but I haven't tried it.

There are so many variables to try - it is a little overwhelming...

nodesocket•12h ago

Would be interesting to test. I’m thinking there may not be a benefit to having so many workers on a vm instead of just the entire vm resources as a single worker. Could be wrong, but that would be a bit surprising.

boshomi•13h ago

>“In our talk, we will describe the design rationale of the DuckLake format and its principles of simplicity, scalability, and speed. We will show the DuckDB implementation of DuckLake in action and discuss the implications for data architecture in general.

Prof. Hannes Mühleisen, cofounder of DuckDB:

[DuckLake - The SQL-Powered Lakehouse Format for the Rest of Us by Prof. Hannes Mühleisen](https://www.youtube.com/watch?v=YQEUkFWa69o) (53 min) Talk from Systems Distributed '25: https://systemsdistributed.com

ferguess_k•13h ago

Wait until you see a 800-line Tableau query that joins TB data with TB data /s

philbe77•13h ago

:D that is scary!

kwillets•12h ago

Don't forget the 2 hour tableau cloud runtime limit.

NorwegianDude•13h ago

This is very silly. You're not doing the challenge if you do the work up front. The idea is that you start with a file and the goal is to get the result as fast as possible.

How long did it take to distribute and import the data to all workers, what is the total time from file to result?

I can do this a million times faster on one machine, it just depends on what work I do up front.

philbe77•13h ago

You should do it then, and post it here. I did do it with one machine as well: https://gizmodata.com/blog/gizmosql-one-trillion-row-challen...

NorwegianDude•12h ago

Nobody cares if I can do it a million times faster, everyone can. It's cheating.

The whole reason you have to account for the time you spend setting it up is so that all work spent processing the data is timed. Otherwise we can just precomputed the answer and print it on demand, that is very fast and easy.

Just getting it into memory is a large bottleneck in the actual challenge.

If I first put it into a DB with statistics that tracks the needed min/max/mean then it's basically instant to retrieve, but also slower to set up because that work needs to be done somewhere. That's why the challenge is time from file to result.

ta12653421•13h ago

When reading such extreme numbers, I'm always thinking what I may be doing wrong, when my MSSQL based CRUD application warms up its caches with around 600.000 rows and it takes 30 seconds to load them from DB into RAM on my 4x3GHz machine :-D

Maybe I'm missing something fundamental here

dgan•13h ago

I also had misfortune working with MSSQL is it was so so unbearably slow, because i couldnt upload data in bulk. I guess its forbidden technology

Foobar8568•12h ago

Or you didn't use MSSQL properly, there are at least 2 or 3 ways to do bulk upload on MS SQL, not sure in today era.

dgan•8h ago

Maybe? Don't know. I never had problemes bulk uploading into Postgres tho, it's right there in documentation and I don't have to have a weird executable on my corporately castrated laptop

rovr138•13h ago

Would OLAP be better than OLTP for those queries you're doing?

zwnow•13h ago

This type of stuff is usually hyperoptimized for no reason and serves no real purpose, you are doing just fine

RobinL•13h ago

Yes - OLAP database are built with a completely different performance tradeoff. The way data is stored and the query planner are optimised for exactly these types of queries. If you're working in an oltp system, you're not necessarily doing it wrong, but you may wish to consider exporting the data to use in an OLAP tool if you're frequently doing big queries. And nowadays there's ways to 'do both ' e.g. you can run the duckdb query engine within a postgres instance

riku_iki•11h ago

Could you run some query like select sum(banch of columns) from my_table and see how long it will take?

600k rows is likely less than 1GB of data, and should take about second to load into RAM on modern nvme ssd raids.

mosselman•12h ago

Are there any good instructions somewhere on how to set this up? As in not 63 nodes. But a distributed duckdb instance

philbe77•12h ago

Hi mosselman, GizmoEdge is not open-source. DeepSeek has "smallpond" however, which is open-source: https://github.com/deepseek-ai/smallpond

I plan on getting GizmoEdge to production-grade quality eventually so folks can use it as a service or licensed software. There is a lot of work to do, though :)

djhworld•12h ago

Interesting and fun

> Workers download, decompress, and materialize their shards into DuckDB databases built from Parquet files.

I'm interested to know whether the 5s query time includes this materialization step of downloading the files etc, or is this result from workers that have been "pre-warmed". Also is the data in DuckDB in memory or on disk?

philbe77•12h ago

hi djhworld. The 5s does not include the download/materialization step. That parts takes the worker about 1 to 2 minutes for this data set. I didn't know that this was going on HackerNews or would be this popular - I will try to get more solid stats on that part, and update the blog accordingly.

You can have GizmoEdge reference cloud (remote) data as well, but of course that would be slower than what I did for the challenge here...

The data is on disk - on locally mounted NVMe on each worker - in the form of a DuckDB database file (once the worker has converted it from parquet). I originally kept the data in parquet, but the duckdb format was about 10 to 15% faster - and since I was trying to squeeze every drop of performance - I went ahead and did that...

Thanks for the questions.

GizmoEdge is not production yet - this was just to demonstrate the art of the possible. I wanted to divide-and-conquer a huge dataset with a lot of power...

philbe77•12h ago

I've since learned (from a DuckDB blog) - that DuckDB seems to do better when the XFS filesytem. I used ext4 for this, so I may be able to get another 10 to 15% (maybe!).

DuckDB blog: https://duckdb.org/2025/10/09/benchmark-results-14-lts

sammy2255•12h ago

How would a 63 node Clickhouse cluster compare? >:)

lolive•12h ago

Why doesn't such large-scale test the big feature everyone needs, which is inner join at scale?

philbe77•12h ago

This is something we are trying to take a novel approach to as well. We have a video demonstrating some TPC-H SF10TB queries which perform inner joins, etc. - with GizmoEdge as well: https://www.youtube.com/watch?v=hlSx0E2jGMU

lolive•11h ago

Does that study go into the global vision of DuckLake ?

1a527dd5•12h ago

The title buries the lede a little

> Our cluster ran on Azure Standard E64pds v6 nodes, each providing 64 vCPUs and 504 GiB of RAM.

Yes, I would _expect_ when each node has that kind of power it should return very impressive speeds.

vysakh0•11h ago

Duckdb is an excellent OLAP db, I have had customers who had s3 data lake of parquet and use databricks or other expensive tool, when they could easily use duckdb.. Given we have cursor/claude code, it is not that hard for lot of use cases, I think the lack of documentation on how duckdb functions -- in terms of how it loads these files etc are some of the reasons companies are not even trying to adopt duckdb. I think blogs like this is a great testament for duckdb's performance!

mrtimo•11h ago

I have experience with duckDB but not databricks... from the perspective of a company, is a tool like databricks more "secure" than duckdb? If my company adopts duckdb as a datalake, how do we secure it?

rapatel0•11h ago

Duckdb can run as a local instance that points to parquet files in a n s3 bucket. So your "auth" can live on the layer that gives permissions to access that bucket.

lopatin•11h ago

DuckDB is great but it’s barely OLAP right? A key part of OLAP is “online”. Since the writer process blocks any other processes from doing reads, calling it OLAP is a stretch I think.

ansgri•10h ago

Isn't the Online part here about getting results immediately after query, as opposed to overnight batch reports? So if you don't completely overwhelm DuckDB with writes, it still qualifies. The quality you're describing is something like "realtime analytics", and is a whole another category: Clickhouse doesn't qualify (batching updates, merging etc. — but it's clearly OLAP), Druid does.

lopatin•9h ago

Huh yeah looks like I was totally wrong about what online meant. So yeah DuckDB is OLAP. Not that anyone was asking me in the first place. Carry on :)

sdairs•9h ago

ClickHouse is the market leader in real-time analytics so it's an interesting take that you don't think it qualifies.

ansgri•6h ago

For certain definition of realtime, certainly (as would any system with bounded ingestion latency), but it’s not low-latency streaming realtime. Tens of seconds or more can pass before new data becomes visible in queries in normal operation. There’s batching, there’s merging, and its overall architecture prioritizes throughput over latency.

adammarples•9h ago

I have been playing today with ducklake, and I have to confess I don't quite get what it does that duckdb doesn't already do, if duckdb can just run on top of parquet files quite happily without this extension...

RobinL•8h ago

It's main purpose is to solve the problem of upserts to a data lake, because upsert operations to file based data storage are a real pain.

sdairs•11h ago

Pretty big caveat; 5 seconds AFTER all data has been loaded into memory - over 2 minutes if you also factor reading the files from S3 and loading memory. So to get this performance you will need to run hot: 4000 CPUs and and 30TB of memory going 24/7.

CaptainOfCoit•11h ago

Yeah, pretty misleading it feels like.

For background, here is the initial ideation of the "One Trillion Row Challenge" challenge this submission originally aimed to participate in: https://docs.coiled.io/blog/1trc.html

lumost•11h ago

It does make me wonder whether all of the investment in hot-loading of GPU infrastructure for LLM workloads is portable to databases. 30TB of GPU memory will be roughly 200 B200 cards or roughly 1200 per hour compared to the $240/hour pricing for the CPU based cluster. The GPU cluster would assuredly crush the CPU cluster with a suitable DB given it has 80x the FP32 FLOP capacity. You'd expect the in-memory GPU solution to be cheaper (assuming optimized software) with a 5x growth in GPU memory per card, or today if the workload can be bin-packed efficiently.

eulgro•10h ago

Do databases do matrix multiplication? Why would they even use floats?

lumost•9h ago

lot's of columns are float valued, GPU tensor cores can be programmed to do many operations between different float/int valued vectors. Strings can also be processed in this manner as they are simply vectors of integers. NVidia publishes official TPC benchmarks for each GPU release.

The idea of a GPU database has been reasonably well explored, they are extremely fast - but have been cost ineffective due to GPU costs. When the dataset is larger than GPU memory, you also incur slowdowns due to cycling between CPU and GPU memory.

radarsat1•9h ago

what do you think vector databases are? absolutely. i think the idea of a database and a "model" could start to really be merged this way..

lisbbb•8h ago

That's a great question. I never worked on any cool NASA stuff which would involve large scale number crunching. In the corpo space, that's not been my experience at all. We were trying to solve big data problems of like, how to report on medical claims that are in flight (which are hardly ever static until much later after the claim is long completed and no longer interesting to anyone) and do it at scale of tens of thousands per hour. It never went that well, tbh, because it is so hard to validate what a "claim" even is since it is changing in real time. I don't think excess GPUs would help with that.

mey•11h ago

So how would that compare to DynamoDB or BigQuery? (I have zero interest in paying for running that experiment).

In theory a Zen 5 / Eypc Turin can have up to 4TB of ram. So how would a more traditional non-clustered DB stand up?

1000 k8s pods, each with 30gb of ram, there has to be a bit of overhead/wastage going on.

mulmen•11h ago

Are you asking how Dynamo compares at the storage level? Like in comparison to S3? As a key-value database it doesn’t even have a native aggregation capability. It’s a very poor choose for OLAP.

BigQuery is comparable to DuckDB. I’m curious how the various Redshift flavors (provisioned, serverless, spectrum) and Spark compare.

I don’t have a lot of experience with DuckDB but it seems like Spark is the most comparable.

fifilura•6h ago

BigQuery is built for the distributed case while DuckDB is single CPU and requires the workarounds described in the article to act like a distributed engine.

mulmen•5h ago

Ah ok. Maybe that does make sense as a comparison to ask if you need an analytics stack or can just grind through your prod Dynamo.

trhway•10h ago

the https://sortbenchmark.org has always stipulated "Must sort to and from operating system files on secondary storage." and thus felt as a more reasonable estimate of overall system performance

philbe77•9h ago

hi sdairs, we did store the data on the worker nodes for the challenge, but not in memory. We wrote the data to the local NVMe SSD storage on the node. Linux may cache the filesystem data, but we didn't load the data directly into memory. We like to preserve the memory for aggregations, joins, etc. as much as possible...

It is true you would need to run the instance(s) 24/7 to get the performance all day, the startup time over a couple minutes is not ideal. We have a lot of work to do on the engine, but it has been a fun learning experience...

otterley•8h ago

“Linux may cache the filesystem data” means there’s a non-zero likelihood that the data in memory unless you dropped caches right before you began the benchmark. You don’t have to explicitly load it into memory for this to be true. What’s more, unless you are in charge of how memory is used, the kernel is going to make its own decisions as to what to cache and what to evict, which can make benchmarks unreproducible.

It’s important to know what you are benchmarking before you start and to control for extrinsic factors as explicitly as possible.

sdairs•8h ago

Thanks for clarifying; I'm not trying to take anything away from you, I work in the OLAP space too so it's always good to see people pushing it forwards. It would be interesting to see a comparison of totally cold Vs hot caches.

Are you looking at distributed queries directly over S3? We did this in ClickHouse and can do instant virtual sharding over large data sets S3. We call it parallel replicas https://clickhouse.com/blog/clickhouse-parallel-replicas

tanelpoder•6h ago

(I submitted this link). My interest in this approach in general is about observability infra at scale - thinking about buffering detailed events, metrics and thread samples at the edge and later only extract things of interest, after early filtering at the edge. I’m a SQL & database nerd, thus this approach looks interesting.

jamesblonde•7h ago

With 2 modern NVMe disks per host (15 GB/s) and pcie 5.0, it should only take 15s to read 30 TB into memory on 63 hosts.

You can find those disks on Hetzner. Not AWS, though.

jiggawatts•2h ago

I don’t understand why both Azure and AWS have local SSDs that are an order of magnitude slower than what I can get in a laptop. If Hetzner can do it, surely so can they!

Not to mention that Azure now exposes local drives as raw NVMe devices mapped straight through to the guest with no virtualisation overheads.

lisbbb•8h ago

Wow (Owen Wilson voice). That's still impressive that it can be done. Just having 4k cpus going reliably for any period of time is pretty nifty. The problem I have run into is that even big companies say they want this kind of compute until they get the bill for it.

afpx•11h ago

I’ve never used DuckDB, but I was surprised by the 30 GiB of memory. Many years ago when I used to use EMR a lot, I would go for > 10 TiB of RAM to keep all the data in memory and only spill over to SSD on big joins.

kwillets•11h ago

This is fun, but I'm confused by the architecture. Duckdb is based on one-off queries that can scale momentarily and then disappear, but this seems to run on k8s and maintain a persistent distributed worker pool.

This pool lacks many of the features of a distributed cluster such as recovery, quorum, and storage state management, and queries run through a single server. What happens when a node goes down? Does it give up, replan, or just hang? How does it divide up resources between multiple requests? Can it distribute joins and other intermediate operators?

I have a soft spot in my heart for duckdb, but its uniqueness is in avoiding the large-scale clustering that other engines already do reasonably well.

fHr•9h ago

Better and worth more then all the quantum bs I have to listen to.

hbarka•9h ago

SELECT COUNT(DISTINCT) has entered the challenge.

philbe77•9h ago

good point :) - we can re-aggregate HyperLogLog (HLL) sketches to get a pretty accurate NDV (Count Distinct) - see Query.farm's DataSketches DuckDB extension here: https://github.com/Query-farm/datasketches

We also have Bitmap aggregation capabilities for exact count distinct - something I worked with Oracle, Snowflake, Databricks, and DuckDB labs on implementing. It isn't as fast as HLL - but it is 100% accurate...

fifilura•7h ago

I remember BigQuery had Distinct with HLL accuracy 10 years ago but rather quickly replaced it with actual accuracy.

How would you compare this solution to BigQuery?

up2isomorphism•8h ago

Sensational title, a reflection of “attention is all you need”.(pun intended)

fifilura•7h ago

Isn't Trino built for exactly this, without the quirky workarounds?

peter_d_sherman•41m ago

>"The GizmoEdge Server receives a SQL query from the client, parses it, and generates two statements:

o A worker SQL to execute on each distributed node

o A combinatorial SQL to run server-side for final aggregation"

A specific instance of MapReduce (using SQL!): https://en.wikipedia.org/wiki/MapReduce

The Swift SDK for Android

Unlocking Free WiFi on British Airways

People with blindness can read again after retinal implant

Valetudo: Cloud replacement for vacuum robots enabling local-only operation

What Is Intelligence?

First shape found that can't pass through itself

Context engineering is sleeping on the humble hyperlink

I invited strangers to message me through a receipt printer

Harnessing America's Heat Pump Moment

Deepagent: A powerful desktop AI assistant

Advice for New Principal Tech ICs (I.e., Notes to Myself)

How to make a Smith chart

Study: MRI contrast agent causes harmful metal buildup in some patients

Code Like a Surgeon

Public Montessori programs strengthen learning outcomes at lower costs: study

Twake Drive – An open-source alternative to Google Drive

Modern Perfect Hashing

The fix wasn't easy, or C precedence bites

Why formalize mathematics – more than catching errors

Conductor (YC S24) Is Hiring a Founding Engineer in San Francisco

Carmack on Operating Systems (1997)

Mesh2Motion – Open-source web application to animate 3D models

Underdetermined Weaving with Machines (2021) [video]

Why can't transformers learn multiplication?

New OSM file format: 30% smaller than PBF, 5x faster to import

Debian Technical Committee overrides systemd change

Typst 0.14

Interstellar Mission to a Black Hole

TextEdit and the relief of simple software

The Great Butterfly Heist

A sharded DuckDB on 63 nodes runs 1T row aggregation challenge in 5 sec

Comments

The Swift SDK for Android

Unlocking Free WiFi on British Airways

People with blindness can read again after retinal implant

Valetudo: Cloud replacement for vacuum robots enabling local-only operation

What Is Intelligence?

First shape found that can't pass through itself

Context engineering is sleeping on the humble hyperlink

I invited strangers to message me through a receipt printer

Harnessing America's Heat Pump Moment

Deepagent: A powerful desktop AI assistant

Advice for New Principal Tech ICs (I.e., Notes to Myself)

How to make a Smith chart

Study: MRI contrast agent causes harmful metal buildup in some patients

Code Like a Surgeon

Public Montessori programs strengthen learning outcomes at lower costs: study

Twake Drive – An open-source alternative to Google Drive

Modern Perfect Hashing

The fix wasn't easy, or C precedence bites

Why formalize mathematics – more than catching errors

Conductor (YC S24) Is Hiring a Founding Engineer in San Francisco

Carmack on Operating Systems (1997)

Mesh2Motion – Open-source web application to animate 3D models

Underdetermined Weaving with Machines (2021) [video]

Why can't transformers learn multiplication?

New OSM file format: 30% smaller than PBF, 5x faster to import

Debian Technical Committee overrides systemd change

Typst 0.14

Interstellar Mission to a Black Hole

TextEdit and the relief of simple software

The Great Butterfly Heist