frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Command-line Tools can be 235x Faster than your Hadoop Cluster (2014)

https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html
35•tosh•3h ago

Comments

killingtime74•1h ago
Hadoop, blast from the past
ejoebstl•1h ago
Great article. Hadoop (and other similar tools) are for datasets so huge they don't fit on one machine.
vjerancrnjak•1h ago
https://www.scylladb.com/2019/12/12/how-scylla-scaled-to-one...

I like this one where they put a dataset on 80 machines only then for someone to put the same dataset on 1 Intel NUC and outperform in query time.

https://altinity.com/blog/2020-1-1-clickhouse-cost-efficienc...

Datasets never become big enough…

literalAardvark•1h ago
Well yeah, but that's a _very_ different engineering decision with different constraints, it's not fully apples to apples.

Having materialised views increases insert load for every view, so if you want to slice your data in a way that wasn't predicted, or that would have increased ingress load beyond what you've got to spare, say, find all devices with a specific model and year+month because there's a dodgy lot, you'll really wish you were on a DB that can actually run that query instead of only being able to return your _precalculated_ results.

DetroitThrow•51m ago
>Datasets never become big enough…

Not only is this a contrived non-comparison, but the statement itself is readily disproven by the limitations basically _everyone_ using single instance ClickHouse often run into if they actually have a large dataset.

Spark and Hadoop have their place, maybe not in rinky dink startup land, but definitely in the world of petabyte and exabyte data processing.

saberience•24m ago
Well, at my old company we had some datasets in the 6-8 PB range, so tell me how we would run analytics on that dataset on an Intel NUC.

Just because you don't have experience of these situations, it doesn't mean they don't exist. There's a reason Hadoop and Spark became synonymous with "big data."

PunchyHamster•1h ago
And we can have pretty fucking big single machines right now
rcarmo•1h ago
This has been a recurring theme for ages, with a few companies taking it to extremes—there are people transpiring COBOL to bash too…
paranoidrobot•1h ago
A selection of times it's been previously posted:

(2018, 222 comments) https://news.ycombinator.com/item?id=17135841

(2022, 166 comments) https://news.ycombinator.com/item?id=30595026

(2024, 139 comments) https://news.ycombinator.com/item?id=39136472 - by the same submitter as this post.

nasretdinov•1h ago
And now with things like DuckDB and clickhouse-local you won't have to worry about data processing performance ever again. Just kidding, but especially with ClickHouse it's so much better to handle the large data volume compared to the past, and even a single beefy server is often enough to satisfy all data analytics needs for a moderate-to-large company.
torginus•1h ago
When I worked as a data engineer, I rewrote some Bash and Python scripts into C# that were previously processing gigabytes of JSON at 10s of MB/s - creating a huge bottleneck.

By applying some trivial optimizations, like streaming the parsing, I essentially managed to get it to run at almost disk speed (1GB/s on an SSD back then).

Just how much data do you need when these sort of clustered approaches really start to make sense?

embedding-shape•51m ago
> I rewrote some Bash and Python scripts into C# that were previously processing gigabytes of JSON

Hah, incredibly funny, I remember doing the complete opposite about 15 years ago, some beginner developer had setup a whole interconnected system with multiple processes and what not in order to process a bunch of JSON and it took forever. Got replaced with a bash script + Python!

> Just how much data do you need when these sort of clustered approaches really start to make sense?

I dunno exactly what thresholds others use, but I usually say if it'd take longer than a day to process (efficiently), then you probably want to figure out a better way than just running a program on a single machine to do it.

commandersaki•48m ago
How do you stream parse json? I thought you need to ingest it whole to ensure it is syntactically valid, and most parsers don't work with inchoate or invalid json? Or at least it doesn't seem trivial.
rented_mule•29m ago
I don't know what the GP was referring too, but often this is about "JSONL" / "JSON Lines" - files containing one JSON object per line. This is common for things like log files. So, process the data as each line is deserialized rather than deserializing the entire file first.
shakna•27m ago
There's a whole heap of approaches, each with their own tradeoffs. But most of them aren't trivial, no. And most end up behaving erratically with invalid json.

You can buffer data, or yield as it becomes available before discarding, or use the visitor pattern, and others.

One Python library that handles pretty much all of them, as a place to start learning, would be: https://github.com/daggaz/json-stream

giovannibonetti•26m ago
You assume it is valid, until it isn't and you can have different strategies to handle that, like just skipping the broken part and carrying on.

Anyway, you write a state machine that processes the string in chunks – as you would do with a regular parser – but the difference is that the parser is eager to spit out a stream of data that matches the query as soon as you find it.

The objective is to reduce the memory consumption as much as possible, so that your program can handle an unbounded JSON string and only keep track of where in the structure it currently is – like a jQuery selector.

zjaffee•26m ago
It's not about how much data you have, but also the sorts of things you are running on your data. Joins and group by's scale much faster than any aggregation. Additionally, you have a unified platform where large teams can share code in a structured way for all data processing jobs. It's similar in how companies use k8s as a way to manage the human side of software development in that sense.

I can however say that when I had a job at a major cloud provider optimizing spark core for our customers, one of the key areas where we saw rapid improvement was simply through fewer machines with vertically scaled hardware almost always outperformed any sort of distributed system (abet not always from a price performance perspective).

The real value often comes from the ability to do retries, and leverage left over underutilized hardware (i.e. spot instances, or in your own data center at times when scale is lower), handle hardware failures, ect, all with the ability for the full above suite of tools to work.

rented_mule•22m ago
I like the peer comment's answer about a processing time threshold (e.g., a day). Another obvious threshold is data that doesn't conveniently fit on local disks. Large scale processing solutions can often process directly from/to object stores like S3. And if it's running inside the same provider (e.g., AWS in the case of S3), data can often be streamed much faster than with local SSDs. 10GB/s has been available for a decade or more, and I think 100GB/s is available these days.
KolmogorovComp•13m ago
> Just how much data do you need when these sort of clustered approaches really start to make sense?

I did not see your comment earlier, but to stay with Chess see https://news.ycombinator.com/item?id=46667287, with ~14Tb uncompressed.

It's not humongous and it can certainly fit on disk(s), but not on a typical laptop.

MarginalGainz•44m ago
The saddest part about this article being from 2014 is that the situation has arguably gotten worse.

We now have even more layers of abstraction (Airflow, dbt, Snowflake) applied to datasets that often fit entirely in RAM.

I've seen startups burning $5k/mo on distributed compute clusters to process <10GB of daily logs, purely because setting up a 'Modern Data Stack' is what gets you promoted, while writing a robust bash script is seen as 'unscalable' or 'hacky'. The incentives are misaligned with efficiency.

petcat•21m ago
> a robust bash script

These hardly exist in practice.

But I get what you mean.

RobinL•3m ago
Worse in some ways, better in others. DuckDB is often an excellent tool for this kind of task.
rented_mule•35m ago
A little bit of history related to the article for any who might be interested...

mrjob, the tool mentioned in the article, has a local mode that does not use Hadoop, but just runs on the local computer. That mode is primarily for developing jobs you'll later run on a Hadoop cluster over more data. But, for smaller datasets, that local mode can be significantly faster than running on a cluster with Hadoop. That's especially true for transient AWS EMR clusters — for smaller jobs, local mode often finishes before the cluster is up and ready to start working.

Even so, I bet the author's approach is still significantly faster than mrjob's local mode for that dataset. What MapReduce brought was a constrained computation model that made it easy to scale way up. That has trade-offs that typically aren't worth it if you don't need that scale. Scaling up here refers to data that wouldn't easily fit on disks of the day — the ability to seamlessly stream input/output data from/to S3 was powerful.

I used mrjob a lot in the early 2010s — jobs that I worked on cumulatively processed many petabytes of data. What it enabled you to do, and how easy it was to do it, was pretty amazing when it was first released in 2010. But it hasn't been very relevant for a while now.

KolmogorovComp•16m ago
> The first thing to do is get a lot of game data. This proved more difficult than I thought it would be, but after some looking around online I found a git repository on GitHub from rozim that had plenty of games. I used this to compile a set of 3.46GB of data, which is about twice what Tom used in his test. The next step is to get all that data into our pipeline.

It would be interesting to redo the benchmark but with a (much) larger database.

Nowadays the biggest open-data for chess must comes from Lichess https://database.lichess.org, with ~7B games and 2.34 TB compressed, ~14TB uncompressed.

Would Hadoop win here?

fmajid•13m ago
I've contributed to PrestoDB, but the availability of DuckDB and fast multi core machines with even faster SSDs makes the need for distribution all the more niche, or even cargo-culting Google or Meta.

Iconify: Library of Open Source Icons

https://icon-sets.iconify.design/
292•sea-gold•5h ago•31 comments

Consent-O-Matic

https://github.com/cavi-au/Consent-O-Matic
86•throawayonthe•3h ago•40 comments

ThinkNext Design

https://thinknextdesign.com/home.html
120•__patchbit__•6h ago•54 comments

Starting from scratch: Training a 30M Topological Transformer

https://www.tuned.org.uk/posts/013_the_topological_transformer_training_tauformer
11•tuned•1h ago•0 comments

Command-line Tools can be 235x Faster than your Hadoop Cluster (2014)

https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html
35•tosh•3h ago•26 comments

Profession by Isaac Asimov

https://www.abelard.org/asimov.php
99•bkudria•10h ago•14 comments

Keystone (YC S25) Is Hiring

1•pablo24602•43m ago

The longest Greek word

https://en.wikipedia.org/wiki/Lopado%C2%ADtemacho%C2%ADselacho%C2%ADgaleo%C2%ADkranio%C2%ADleipsa...
139•firloop•8h ago•61 comments

ASCII characters are not pixels: a deep dive into ASCII rendering

https://alexharri.com/blog/ascii-rendering
1039•alexharri•1d ago•119 comments

jQuery 4

https://blog.jquery.com/2026/01/17/jquery-4-0-0/
375•OuterVale•8h ago•113 comments

Show HN: GibRAM an in-memory ephemeral GraphRAG runtime for retrieval

https://github.com/gibram-io/gibram
27•ktyptorio•5h ago•4 comments

The recurring dream of replacing developers

https://www.caimito.net/en/blog/2025/12/07/the-recurring-dream-of-replacing-developers.html
486•glimshe•22h ago•384 comments

No knives, only cook knives

https://kellykozakandjoshdonald.substack.com/p/no-knives-only-cook-knives
74•firloop•13h ago•21 comments

Kip: A programming language based on grammatical cases of Turkish

https://github.com/kip-dili/kip
194•nhatcher•15h ago•59 comments

We put Claude Code in Rollercoaster Tycoon

https://labs.ramp.com/rct
470•iamwil•5d ago•265 comments

The grab list: how museums decide what to save in a disaster

https://www.economist.com/1843/2025/11/21/the-grab-list-how-museums-decide-what-to-save-in-a-disa...
20•surprisetalk•3d ago•2 comments

Five Practical Lessons for Serving Models with Triton Inference Server

https://talperry.com/en/posts/genai/triton-inference-server/
11•talolard•4d ago•1 comments

Raising money fucked me up

https://blog.yakkomajuri.com/blog/raising-money-fucked-me-up
283•yakkomajuri•18h ago•98 comments

Play chess via Slack DMs or SMS using an ASCII board

https://github.com/dvelton/dm-chess
13•dustfinger•6d ago•4 comments

If you put Apple icons in reverse it looks like someone getting good at design

https://mastodon.social/@heliographe_studio/115890819509545391
577•lateforwork•12h ago•221 comments

Erdos 281 solved with ChatGPT 5.2 Pro

https://twitter.com/neelsomani/status/2012695714187325745
210•nl•8h ago•176 comments

Xous Operating System

https://xous.dev/
142•eustoria•3d ago•53 comments

Building a better Bugbot

https://cursor.com/blog/building-bugbot
33•onurkanbkrc•2d ago•12 comments

Throwing it all away over the Mercator projection

https://danieldrezner.substack.com/p/what-is-trump-even-doing-at-this
12•jhide•1h ago•0 comments

The Olivetti Company

https://www.abortretry.fail/p/the-olivetti-company
193•rbanffy•6d ago•41 comments

Data Activation Thoughts

https://galsapir.github.io/sparse-thoughts/2026/01/17/data_activation/
9•galsapir•11h ago•2 comments

Computer Systems Security 6.566 / Spring 2024

https://css.csail.mit.edu/6.858/2024/
91•barishnamazov•12h ago•12 comments

An Elizabethan mansion's secrets for staying warm

https://www.bbc.com/future/article/20260116-an-elizabethan-mansions-secrets-for-staying-warm
160•Tachyooon•19h ago•159 comments

Claude Shannon's randomness-guessing machine

https://www.loper-os.org/bad-at-entropy/manmach.html
25•Kotlopou•5d ago•10 comments

Why Object of Arrays beat interleaved arrays: a JavaScript performance issue

https://www.royalbhati.com/posts/js-array-vs-typedarray
36•howToTestFE•1w ago•13 comments