I like this one where they put a dataset on 80 machines only then for someone to put the same dataset on 1 Intel NUC and outperform in query time.
https://altinity.com/blog/2020-1-1-clickhouse-cost-efficienc...
Datasets never become big enough…
Having materialised views increases insert load for every view, so if you want to slice your data in a way that wasn't predicted, or that would have increased ingress load beyond what you've got to spare, say, find all devices with a specific model and year+month because there's a dodgy lot, you'll really wish you were on a DB that can actually run that query instead of only being able to return your _precalculated_ results.
Not only is this a contrived non-comparison, but the statement itself is readily disproven by the limitations basically _everyone_ using single instance ClickHouse often run into if they actually have a large dataset.
Spark and Hadoop have their place, maybe not in rinky dink startup land, but definitely in the world of petabyte and exabyte data processing.
Just because you don't have experience of these situations, it doesn't mean they don't exist. There's a reason Hadoop and Spark became synonymous with "big data."
(2018, 222 comments) https://news.ycombinator.com/item?id=17135841
(2022, 166 comments) https://news.ycombinator.com/item?id=30595026
(2024, 139 comments) https://news.ycombinator.com/item?id=39136472 - by the same submitter as this post.
By applying some trivial optimizations, like streaming the parsing, I essentially managed to get it to run at almost disk speed (1GB/s on an SSD back then).
Just how much data do you need when these sort of clustered approaches really start to make sense?
Hah, incredibly funny, I remember doing the complete opposite about 15 years ago, some beginner developer had setup a whole interconnected system with multiple processes and what not in order to process a bunch of JSON and it took forever. Got replaced with a bash script + Python!
> Just how much data do you need when these sort of clustered approaches really start to make sense?
I dunno exactly what thresholds others use, but I usually say if it'd take longer than a day to process (efficiently), then you probably want to figure out a better way than just running a program on a single machine to do it.
You can buffer data, or yield as it becomes available before discarding, or use the visitor pattern, and others.
One Python library that handles pretty much all of them, as a place to start learning, would be: https://github.com/daggaz/json-stream
Anyway, you write a state machine that processes the string in chunks – as you would do with a regular parser – but the difference is that the parser is eager to spit out a stream of data that matches the query as soon as you find it.
The objective is to reduce the memory consumption as much as possible, so that your program can handle an unbounded JSON string and only keep track of where in the structure it currently is – like a jQuery selector.
I can however say that when I had a job at a major cloud provider optimizing spark core for our customers, one of the key areas where we saw rapid improvement was simply through fewer machines with vertically scaled hardware almost always outperformed any sort of distributed system (abet not always from a price performance perspective).
The real value often comes from the ability to do retries, and leverage left over underutilized hardware (i.e. spot instances, or in your own data center at times when scale is lower), handle hardware failures, ect, all with the ability for the full above suite of tools to work.
I did not see your comment earlier, but to stay with Chess see https://news.ycombinator.com/item?id=46667287, with ~14Tb uncompressed.
It's not humongous and it can certainly fit on disk(s), but not on a typical laptop.
We now have even more layers of abstraction (Airflow, dbt, Snowflake) applied to datasets that often fit entirely in RAM.
I've seen startups burning $5k/mo on distributed compute clusters to process <10GB of daily logs, purely because setting up a 'Modern Data Stack' is what gets you promoted, while writing a robust bash script is seen as 'unscalable' or 'hacky'. The incentives are misaligned with efficiency.
These hardly exist in practice.
But I get what you mean.
mrjob, the tool mentioned in the article, has a local mode that does not use Hadoop, but just runs on the local computer. That mode is primarily for developing jobs you'll later run on a Hadoop cluster over more data. But, for smaller datasets, that local mode can be significantly faster than running on a cluster with Hadoop. That's especially true for transient AWS EMR clusters — for smaller jobs, local mode often finishes before the cluster is up and ready to start working.
Even so, I bet the author's approach is still significantly faster than mrjob's local mode for that dataset. What MapReduce brought was a constrained computation model that made it easy to scale way up. That has trade-offs that typically aren't worth it if you don't need that scale. Scaling up here refers to data that wouldn't easily fit on disks of the day — the ability to seamlessly stream input/output data from/to S3 was powerful.
I used mrjob a lot in the early 2010s — jobs that I worked on cumulatively processed many petabytes of data. What it enabled you to do, and how easy it was to do it, was pretty amazing when it was first released in 2010. But it hasn't been very relevant for a while now.
It would be interesting to redo the benchmark but with a (much) larger database.
Nowadays the biggest open-data for chess must comes from Lichess https://database.lichess.org, with ~7B games and 2.34 TB compressed, ~14TB uncompressed.
Would Hadoop win here?
killingtime74•1h ago