frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

650GB of Data (Delta Lake on S3). Polars vs. DuckDB vs. Daft vs. Spark

https://dataengineeringcentral.substack.com/p/650gb-of-data-delta-lake-on-s3-polars
41•tanelpoder•2h ago

Comments

esafak•49m ago
If I understand correctly, polars relies on delta-rs for Delta Lake support, and that is what does not support Deletion vectors: https://github.com/delta-io/delta-rs/issues/1094

It seems like these single-node libraries can process a terabyte on a typical machine, and you'd have have over 10TB before moving to Spark.

mynameisash•5m ago
> It seems like these single-node libraries can process a terabyte on a typical machine, and you'd have have over 10TB before moving to Spark.

I'm surprised by how often people jump to Spark because "it's (highly) parallelizable!" and "you can throw more nodes at it easy-peasy!" And yet, there are so many cases where you can just do things with better tools.

Like the time a junior engineer asked for help processing 100s of ~5GB files of JSON data which turned out to be doing crazy amounts of string concatenation in Python (don't ask). It was taking something like 18 hours to run, IIRC, and writing a simple console tool to do the heavy lifting and letting Python's multiprocessing tackle it dropped the time to like 35 minutes.

Right cool for the right job, people.

blmarket•29m ago
Presto (a.k.a. AWS Athena) might be a faster/better alternative? Also would like to see if 650GB data is available locally.
andy99•18m ago
Awk? https://adamdrake.com/command-line-tools-can-be-235x-faster-...
co0lster•15m ago
650GB relates to size of parquet files which are compressed in reality it’s way more.

32 GB of parquet cannot fit in 32GB of RAM

luizfelberti•13m ago
Honestly this benchmark feels completely dominated by the instance's NIC capacity.

They used a c5.4xlarge which has a 10Gbps, which at constant 100% saturation would take in the ballpark of 9 minutes to pull all of that data from S3, so that is your best case scenario for pulling the data (without even considering writing it back!)

Minute differences in how these query engines schedule IO would have drastic effects in the benchmark outcomes, and I doubt the query engine itself was constantly fed during this workload, especially when evaluating DuckDB and Polars.

The irony of workloads like this is that it might be cheaper to pay for a gigantic instance to run the query and finish it quicker, than to pay for a cheaper instance taking several times longer.

jdnier•6m ago
DuckDb has a new "DuckLake" catalog format that would be another candidate to test. https://ducklake.select/

Nano Banana can be prompt engineered for nuanced AI image generation

https://minimaxir.com/2025/11/nano-banana-prompts/
416•minimaxir•6h ago•109 comments

Zed is our office

https://zed.dev/blog/zed-is-our-office
448•sagacity•8h ago•216 comments

OpenMANET Wi-Fi HaLow open-source project for Raspberry Pi–based MANET radios

https://openmanet.net/
51•hexmiles•3h ago•17 comments

650GB of Data (Delta Lake on S3). Polars vs. DuckDB vs. Daft vs. Spark

https://dataengineeringcentral.substack.com/p/650gb-of-data-delta-lake-on-s3-polars
42•tanelpoder•2h ago•7 comments

Launch HN: Tweeks (YC W25) – Browser extension to deshittify the web

https://www.tweeks.io/onboarding
155•jmadeano•8h ago•128 comments

Checkout.com hacked, refuses ransom payment, donates to security labs

https://www.checkout.com/blog/protecting-our-merchants-standing-up-to-extortion
530•StrangeSound•15h ago•234 comments

Blue Origin lands New Glenn rocket booster on second try

https://techcrunch.com/2025/11/13/blue-origin-lands-new-glenn-rocket-booster-on-second-try/
178•perihelions•3h ago•70 comments

Show HN: DBOS Java – Postgres-Backed Durable Workflows

https://github.com/dbos-inc/dbos-transact-java
43•KraftyOne•3h ago•20 comments

SIMA 2: An agent that plays, reasons, and learns with you in virtual 3D worlds

https://deepmind.google/blog/sima-2-an-agent-that-plays-reasons-and-learns-with-you-in-virtual-3d...
161•meetpateltech•9h ago•58 comments

The emergence and diversification of dog morphology

https://www.science.org/doi/10.1126/science.adt0995
7•Marshferm•1h ago•0 comments

Itiner-E – The Digital Atlas of Ancient Roads

https://itiner-e.org/
10•beatthatflight•1w ago•0 comments

Think in math, write in code

https://www.jmeiners.com/think-in-math/
91•alabhyajindal•4d ago•38 comments

Piramidal (YC W24) Hiring: Front End Engineer

https://www.ycombinator.com/companies/piramidal/jobs/i9yNX5s-front-end-engineer-user-interface
1•dsacellarius•3h ago

Why do we need dithering?

https://typefully.com/DanHollick/why-do-we-need-dithering-Ut7oD4k
32•ibobev•1w ago•28 comments

Rust in Android: move fast and fix things

https://security.googleblog.com/2025/11/rust-in-android-move-fast-fix-things.html
262•abraham•5h ago•182 comments

Blender Lab

https://www.blender.org/news/introducing-blender-lab/
196•radeeyate•10h ago•43 comments

SlopStop: Community-driven AI slop detection in Kagi Search

https://blog.kagi.com/slopstop
249•msub2•5h ago•119 comments

How to fix subsystem request failed on channel 0

https://blog.x-way.org/Linux/2025/11/06/How-to-fix-subsystem-request-failed-on-channel-0.html
4•speckx•1w ago•0 comments

GitHub Partial Outage

https://www.githubstatus.com/incidents/1jw8ltnr1qrj
177•danfritz•9h ago•74 comments

Remind: A sophisticated calendar and alarm program

https://dianne.skoll.ca/projects/remind/
32•n3t•6d ago•2 comments

The Eggstraordinary Fortress

https://ahmed1011001.github.io/Notes/stories/eggstrodinary.html
31•tippa123•6h ago•8 comments

Disrupting the first reported AI-orchestrated cyber espionage campaign

https://www.anthropic.com/news/disrupting-AI-espionage
141•koakuma-chan•5h ago•96 comments

The Useful Personal Computer

https://technicshistory.com/2025/11/02/the-useful-personal-computer/
75•cfmcdonald•1w ago•22 comments

Heartbeats in Distributed Systems

https://arpitbhayani.me/blogs/heartbeats-in-distributed-systems/
99•sebg•10h ago•37 comments

The Grand Egyptian Museum's Astonishing Arrival

https://www.wsj.com/arts-culture/fine-art/the-grand-egyptian-museums-astonishing-arrival-ac477d5f
18•bookofjoe•6d ago•8 comments

How To Build A Smartwatch: Software

https://ericmigi.com/blog/how-to-build-a-smartwatch-software-setting-expectations-and-roadmap/
75•teekert•10h ago•40 comments

Denx (a.k.a. U-Boot) Retires

https://www.denx.de/
92•synergy20•10h ago•24 comments

Android developer verification: Early access starts

https://android-developers.googleblog.com/2025/11/android-developer-verification-early.html
1279•erohead•23h ago•611 comments

IBM Patented Euler's 200 Year Old Math Technique for 'AI Interpretability'

https://leetarxiv.substack.com/p/ibm-patented-eulers-fractions
123•busymom0•5h ago•48 comments

We cut our Mongo DB costs by 90% by moving to Hetzner

https://prosopo.io/blog/we-cut-our-mongodb-costs-by-90-percent/
213•arbol•9h ago•164 comments