Would be interesting to see if it would be possible to handle such data on one node, since the servers they are using are quite beefy.
The data is in a publicly accessible bucket, but the requester is responsible for any egress fees...
Do you know if anyone attempted to run this on the least amount of hardware possible with reasonable processing times?
Can someone explain why you would use websockets in an application where neither end is a browser? Why not just use regular sockets and cut the overhead of the http layer? Is there a real benefit I’m missing?
https://en.wikipedia.org/wiki/Constrained_Application_Protoc...
It is UDP-based but adds handshakes and retransmissions. But I am guessing for your benchmark transmission overhead isn't a major concern.
Websockets are not that bad, only the initial connection is HTTP. As long as you don't create a ton of connections all the time it shouldn't be much slower than a TCP-based socket (purely theoretical assumption on my part, I never tested).
There's a whole skit in the vein of "What have the Romans ever done for us?" about ZeroMQ[1] which has probably lost to the search index now.
As someone who has held a socket wrench before, fought tcp_cork and dsack, Websockets isn't a bad abstraction to be on top of, especially if you are intending to throw TLS in there anyway.
Low level sockets is like assembly, you can use it but it is a whole box of complexity (you might use it completely raw sometimes like a tickle ack in the ctdb[2] implementation).
There isn't much overhead here other than connection setup. For HTTP/1 the connection is just "upgraded" to websockets. For HTTP/2 I think the HTTP layer still lives on a bit so that you can use connection multiplexing (which maybe be overhead if you have no use for it here) but that is still a very thin layer.
So I think the question isn't so much HTTP overhead but WebSocket overhead. WebSockets add a bit of message framing and whatnot that may be overhead if you don't need it.
In 99% of applications if you need encryption, authentication and message framing you would be hard-pressed to find a significantly more efficient option.
AFAIK, websockets doesn't do authentication? And the encryption it does is minimal, optional xor with a key disclosed in the handshake. It does do framing.
It's not super common, but if all your messages have a 16-bit length, you can just use TLS framing. I would argue that TLS framing is ineffecient (multiple length terms), but using it by itself is better than adding a redundant framing layer.
But IMHO, there is significant benefit from removing a layer where it'd unneeded.
Websocket allows for custom header and query parameters which make it possible to run a basic authentication scheme and later on additional autorisation in the message themselves if really necessary.
> And the encryption it does is minimal, optional xor with a key disclosed in the handshake. It does do framing.
Web Secure Socket (WSS) is the TLS encrypted version of Websockets (WS) (similar to HTTP vs. HTTPS).
They do during the initial handshake (protocol upgrade from HTTP to WebSocket).
Afterwards the message body can be used to send authorisation data.
Server support will depend on tech but Node.js has great support.
No, I don't think you get it. `new Websocket()` from JS takes no arguments for headers. You literally can't send headers during the handshake from JS. https://developer.mozilla.org/en-US/docs/Web/API/WebSocket/W...
Actually will look into using the subprotocol as a way to do auth, but most impls in the wild send the auth as the first message.
The fact the protocol in theory supports it doesn't really matter much since no browser implements that part of the spec.
WebSockets solve a bunch of those low level problems for you, in a well specified way with plenty of existing libraries.
(This is true regardless of UTF-8 -- in-band encodings are almost always brittle!)
WS(S) has in the box solutions for a lot of these... on top of that, application gateways, distribution, failover etc. You get a lot of already solved solutions in the box, so to speak. If you use raw sockets, now you have to implement all of these things yourself, and you aren't gaining much over just using WSS.
Since they're using Arrow they might look into Flight RPC [1] which is made for this use case.
Detail of this well-covered in sibling comments, but at a higher-level, two thoughts on this:
1. I see a lot of backlash lately against everything being HTTP-ified, with little justification other than a presumption that it necessarily adds overhead. Perf-wise, HTTP has come a long way & modern HTTP is a very efficient protocol. I think this has cleared the way for it to be a foundation for many more things than in the past. HTTP/3 being over UDP might clear the way for more of this (albeit I think the overhead of TCP/IP is also often overstated - see e.g. MQTT).
2. Overhead can be defined in two ways: perf. & maintenance complexity. Modern HTTP does add a bit of the latter, so in that context it may be a fair concern, but I think the large range of competing implementations probably obviates any concern here & the alternative usually involves doing something custom (albeit simpler), so you run into inconsistency, re-invented wheels & bus factor issues there.
High-level spec changes have been infrequent, with long dual support periods, & generally seen pretty slow gradual client & server adoption. 1.1 was 1997 & continues to have widespread support today. 2 & 3 were proposed in 2015 & 2016 - almost 2 decades later - & 2 is only really starting to see wide support today, with 3 still broadly unsupported.
I'm likely missing a lot of nuance in between versioned releases though - I know e.g. 2 saw at least two major additions/updates, though I thought those were mostly additive security features rather than changes to existing protocol features.
There are two ways I've noticed to design an application.
Some people grab some tools out of their toolbox that look like they fit - I need a client/server, I know web clients/servers, so I'll use a web client/server.
Other people think about what the computer actually has to do and then write code to achieve that: Computer A has to send a block of data to computer B, and this has to work on Linux (which means no bit-banging - you can only go as low as raw sockets). This type of person may still take shortcuts, but it's by intention, not because it's the only thing they know: if HTTP is only one function call in Python, it makes sense to use HTTP, not because it's the only thing you know but because it's good enough, you know it works well enough for this problem, and you can change it later if it becomes a bottleneck.
Websockets are an odd choice because they're sort of the worst of both worlds: they're barely more convenient as raw sockets (there's framing, but framing is easy), but they also add a bunch of performance and complexity overhead over raw sockets, and more things that can go wrong. So it doesn't seem to win on the convenience/laziness front nor the performance/security/robustness front. If your client had to be a web browser, or could sometimes be a web browser, or if you wanted to pass the connections through an HTTP reverse proxy, those would be good reasons to choose websockets, but none of them are the case here.
> they're barely more convenient as raw sockets
Honestly, raw sockets are pretty convenient - I'm not convinced Websockets are more convenient at all (assuming you already know both & there's no learning curves). Raw sockets might even be more convenient.
I think it's features rather than convenience that is more likely to drive Websocket usage when comparing the two.
> they also add a bunch of performance and complexity overhead over raw sockets
This is the part that I was getting at in my above comment. I agree in theory, but I just think that the "a bunch" quantifier is bit of an exaggeration. They really add very very little performance overhead in practice: a negligible amount in most cases.
So for a likely-negligible performance loss, & a likely-negligible convenience difference, you're getting a protocol with built-in encryption, widespread documentation & community support - especially important if you're writing code that other people will need to take over & maintain - & as you alluded to: extensibility (you may never need browser support or http proxying, but having the option is compelling when the trade-offs are so negligible).
I think it's significant more convenient if your stack touches multiple programming languages. Otherwise you'd have to implement framing yourself for all of them. Not hard, but I don't see the benefit either.
> they also add a bunch of performance and complexity overhead over raw sockets
What performance overhead is there over raw sockets once you're past the protocol upgrade? It seems negligible if you connection is even slightly long-lived.
- you can serve any number of disjoint websocket services via same port via HTTP routing - this also means you can do TLS termination in one place, so downstream websocket service doesn't have to deal with the nitty-gritty of certificates.
Sure, it adds a hop compared to socket passing, and there are ways to get similar fanout with TCP with a custom protocol. But you need to add this to every stack that interacting components use, while websockets libraries exist for most languages that are likely to be used in such an endeavor.
How was this node setup chosen? Specially 3.8 vCPU and 30 GiB RAM per? Why not just run 16 workers total using the entire 64 vCPU and 504 GiB of memory each?
I was out of quota in Azure - so I had to fit in the 63 nodes... :)
There are so many variables to try - it is a little overwhelming...
Prof. Hannes Mühleisen, cofounder of DuckDB:
[DuckLake - The SQL-Powered Lakehouse Format for the Rest of Us by Prof. Hannes Mühleisen](https://www.youtube.com/watch?v=YQEUkFWa69o) (53 min) Talk from Systems Distributed '25: https://systemsdistributed.com
How long did it take to distribute and import the data to all workers, what is the total time from file to result?
I can do this a million times faster on one machine, it just depends on what work I do up front.
The whole reason you have to account for the time you spend setting it up is so that all work spent processing the data is timed. Otherwise we can just precomputed the answer and print it on demand, that is very fast and easy.
Just getting it into memory is a large bottleneck in the actual challenge.
If I first put it into a DB with statistics that tracks the needed min/max/mean then it's basically instant to retrieve, but also slower to set up because that work needs to be done somewhere. That's why the challenge is time from file to result.
Maybe I'm missing something fundamental here
600k rows is likely less than 1GB of data, and should take about second to load into RAM on modern nvme ssd raids.
I plan on getting GizmoEdge to production-grade quality eventually so folks can use it as a service or licensed software. There is a lot of work to do, though :)
> Workers download, decompress, and materialize their shards into DuckDB databases built from Parquet files.
I'm interested to know whether the 5s query time includes this materialization step of downloading the files etc, or is this result from workers that have been "pre-warmed". Also is the data in DuckDB in memory or on disk?
You can have GizmoEdge reference cloud (remote) data as well, but of course that would be slower than what I did for the challenge here...
The data is on disk - on locally mounted NVMe on each worker - in the form of a DuckDB database file (once the worker has converted it from parquet). I originally kept the data in parquet, but the duckdb format was about 10 to 15% faster - and since I was trying to squeeze every drop of performance - I went ahead and did that...
Thanks for the questions.
GizmoEdge is not production yet - this was just to demonstrate the art of the possible. I wanted to divide-and-conquer a huge dataset with a lot of power...
DuckDB blog: https://duckdb.org/2025/10/09/benchmark-results-14-lts
> Our cluster ran on Azure Standard E64pds v6 nodes, each providing 64 vCPUs and 504 GiB of RAM.
Yes, I would _expect_ when each node has that kind of power it should return very impressive speeds.
For background, here is the initial ideation of the "One Trillion Row Challenge" challenge this submission originally aimed to participate in: https://docs.coiled.io/blog/1trc.html
The idea of a GPU database has been reasonably well explored, they are extremely fast - but have been cost ineffective due to GPU costs. When the dataset is larger than GPU memory, you also incur slowdowns due to cycling between CPU and GPU memory.
In theory a Zen 5 / Eypc Turin can have up to 4TB of ram. So how would a more traditional non-clustered DB stand up?
1000 k8s pods, each with 30gb of ram, there has to be a bit of overhead/wastage going on.
BigQuery is comparable to DuckDB. I’m curious how the various Redshift flavors (provisioned, serverless, spectrum) and Spark compare.
I don’t have a lot of experience with DuckDB but it seems like Spark is the most comparable.
It is true you would need to run the instance(s) 24/7 to get the performance all day, the startup time over a couple minutes is not ideal. We have a lot of work to do on the engine, but it has been a fun learning experience...
It’s important to know what you are benchmarking before you start and to control for extrinsic factors as explicitly as possible.
Are you looking at distributed queries directly over S3? We did this in ClickHouse and can do instant virtual sharding over large data sets S3. We call it parallel replicas https://clickhouse.com/blog/clickhouse-parallel-replicas
You can find those disks on Hetzner. Not AWS, though.
Not to mention that Azure now exposes local drives as raw NVMe devices mapped straight through to the guest with no virtualisation overheads.
This pool lacks many of the features of a distributed cluster such as recovery, quorum, and storage state management, and queries run through a single server. What happens when a node goes down? Does it give up, replan, or just hang? How does it divide up resources between multiple requests? Can it distribute joins and other intermediate operators?
I have a soft spot in my heart for duckdb, but its uniqueness is in avoiding the large-scale clustering that other engines already do reasonably well.
We also have Bitmap aggregation capabilities for exact count distinct - something I worked with Oracle, Snowflake, Databricks, and DuckDB labs on implementing. It isn't as fast as HLL - but it is 100% accurate...
How would you compare this solution to BigQuery?
o A worker SQL to execute on each distributed node
o A combinatorial SQL to run server-side for final aggregation"
A specific instance of MapReduce (using SQL!): https://en.wikipedia.org/wiki/MapReduce
maxmcd•14h ago
hobofan•14h ago
[0]: https://datafusion.apache.org/ballista/contributors-guide/ar...
mritchie712•14h ago
0 - https://github.com/deepseek-ai/smallpond
1 - https://www.definite.app/blog/smallpond (overview for data engineers, practical application)