frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: 7x faster Iceberg ingestion, how we redesigned OLake's writer

https://olake.io/blog/how-olake-becomes-7x-faster/
3•rohankhameshra•56m ago
OLake is our open-source tool for ingesting Database & Kafka data into Apache Iceberg. We recently redesigned the write pipeline and saw ~7x throughput improvements. Sharing the architecture decisions, trade-offs, and benchmarks.

Comments

rohankhameshra•52m ago
Hi everyone, I’m one of the founders of OLake. We’ve been working on a high-throughput, open-source ingestion path for Apache Iceberg, and I wanted to share the latest benchmark results and the architectural changes behind them. Here are the key numbers from the benchmark run:

- On a 4.01 billion-row dataset, OLake sustained around 319,562 rows/sec (full-load) from Postgres into Iceberg.

- The next-best ingestion tool we tested on the same dataset managed about 46,000 rows/sec, making OLake roughly 6.8× faster for full loads.

- For CDC workloads, OLake ingested change batches at around 41,390 rows/sec, compared to ~26,900 rows/sec for the closest alternative.

- Average memory usage was about 44 GB, peaking at ~59 GB on a 64-vCPU / 128 GB RAM VM.

- Parquet file output stabilized at ~300–400 MB per file (after compression), improving performance downstream and avoiding “small file” fragmentation.

How we got these improvements:

1. Rewrote the writer architecture: Parsing, schema evolution, buffering, and batch management now happen in Go. Only the final Parquet and Iceberg write path uses Java. This cut down huge amounts of serialization and JVM churn.

2. Introduced a new batching and buffering model: Instead of producing many small Parquet files, we buffer data in memory per thread and commit large chunks (roughly 4 GB before compression). This keeps throughput high and files uniform.

3. Optimized Iceberg metadata operations: Commits remain atomic even with large batches, and schema evolution happens fully in Go before any write, reducing cross-system coordination.

4. Improved operational stability: CPU, memory, and disk behaviour remained predictable even at multi-billion-row scales.

Benchmark setup:

- Dataset: ~4.01 billion rows from the NYC Taxi + FHV trips parquet sets (row width ~120–144 bytes).

- Test machine: Azure Standard D64ls v5 (64 vCPUs, 128 GB RAM).

Iceberg stored on local NVMe for the benchmark, same architecture works with S3/GCS/HDFS.

The full benchmark results, methodology, and configs are here: https://olake.io/docs/benchmarks/

And the deep-dive into how we got the ~7× speedup is here: https://olake.io/blog/how-olake-becomes-7x-faster/

I’d love feedback from the HN community, specifically around tuning (batch sizes, commit frequency, partitioning strategies), Iceberg best practices, and real-world constraints you’ve seen in high-volume pipelines.

Happy to answer questions or share configs. Thanks for taking a look!

The AI Backlash Is Here: Why Public Patience with Tech Giants Is Running Out

https://www.newsweek.com/ai-backlash-openai-meta-friend-10807425
1•zerosizedweasle•1m ago•0 comments

Ind-QwenTTS – TTS for 'Your Computer Has a Virus' in Authentic Indian Accent

https://huggingface.co/AryanNsc/IND-QWENTTS-V1
1•geniusyan•1m ago•1 comments

Why Most Startups Shouldn't Raise Venture Capital

https://medium.com/@gp2030/why-most-startups-shouldnt-raise-venture-capital-b766e579a1b4
1•light_triad•1m ago•0 comments

Valve rejoins the VR hardware wars with standalone Steam Frame

https://arstechnica.com/gaming/2025/11/valve-rejoins-the-vr-hardware-wars-with-standalone-steam-f...
1•PaulHoule•1m ago•0 comments

Machine Learning for Scientific Discovery

https://mlelarge.github.io/ens-ml4sd/
1•__rito__•2m ago•0 comments

Driverless Waymo vehicle goes through tense police stop in L.A

https://www.nbcnews.com/news/us-news/driverless-waymo-vehicle-inadvertently-takes-riders-tense-po...
1•avbanks•4m ago•1 comments

CUDA Tile

https://www.techpowerup.com/343740/nvidia-announces-cuda-tile-with-cuda-13-1
1•dagmx•4m ago•0 comments

Startups on hard mode: Oxide. Part 1: Hardware (2024)

https://newsletter.pragmaticengineer.com/p/oxide
1•mooreds•5m ago•0 comments

Trump administration orders enhanced vetting for applicants of H-1B visa

https://werd.io/trump-administration-orders-enhanced-vetting-for-applicants-of-h-1b-visa/
1•speckx•5m ago•0 comments

The Forge Calculator

https://theforge-calculator.com/
1•thecrecipe•6m ago•1 comments

Hungarian Notation

https://en.wikipedia.org/wiki/Hungarian_notation
1•__rito__•6m ago•0 comments

$1M Paid to Developers Who Built Railway Templates

https://blog.railway.com/p/1M-paid-to-developers-who-built-railway-templates
1•thisismahmoud_•7m ago•1 comments

Censorship Whac-A-Mole: Google search exploited to scrub article on SF tech exec

https://freedom.press/issues/censorship-whac-a-mole-google-search-exploited-to-scrub-articles-on-...
1•seattle_spring•8m ago•0 comments

Vpternlog: Signed Saturation

https://wunkolo.github.io/post/2025/12/vpternlog-signed-saturation/
1•cremno•8m ago•0 comments

Honest Reviews – what have I done?

https://r8d.ai
1•elepedus•8m ago•0 comments

A 2025 Survey of Rust GUI Libraries

https://www.boringcactus.com/2025/04/13/2025-survey-of-rust-gui-libraries.html
1•6581•8m ago•0 comments

Gitmal

https://github.com/antonmedv/gitmal
1•linhns•9m ago•0 comments

Science E-Books

https://science.nasa.gov/multimedia/science-e-books/
2•Tomte•11m ago•0 comments

Tony Fadell, iPod co-creator, might want to be Apple's next CEO

https://9to5mac.com/2025/12/05/tony-fadell-ipod-co-creator-might-want-to-be-apples-next-ceo-report/
3•retskrad•12m ago•0 comments

We Use API Agents to Build Integrations Fast

https://qckfx.com/blog/how-we-use-api-agents-to-build-integrations-fast
1•chw9e•12m ago•0 comments

Improving Cursor's agent for OpenAI Codex models

https://cursor.com/blog/codex-model-harness
1•janpio•13m ago•0 comments

OpenAI's GPT-5.2 'code red' response to Google is coming next week

https://www.theverge.com/report/838857/openai-gpt-5-2-release-date-code-red-google-response
1•poniko•14m ago•0 comments

Chesterton's Fence and the "No Magic" Approach to AI Data

https://axiussdc.substack.com/p/chestertons-fence-and-the-no-magic
1•twcook•14m ago•0 comments

When a video codec wins an Emmy

https://blog.mozilla.org/en/mozilla/av1-video-codec-wins-emmy/
2•todsacerdoti•15m ago•0 comments

Absurdities and contradictions of my career in crypto

https://www.leku.blog/posts/crypto.html
1•serial_dev•16m ago•0 comments

Feeds, feelings & focus: Cognitive & mental health links to short-form video use

https://pubmed.ncbi.nlm.nih.gov/41231585/
1•CGMthrowaway•16m ago•0 comments

Drone on Drone Battles in Ukraine

https://www.wsj.com/world/drones-fight-other-drones-in-the-battle-for-ukraines-skies-aa78dccb
1•dzink•19m ago•0 comments

The Reverse-Centaur's Guide to Criticizing AI (05 Dec 2025)

https://pluralistic.net/2025/12/05/pop-that-bubble/
2•NoGravitas•20m ago•0 comments

Nix flakes explained: what they solve, why they matter, and the future

https://determinate.systems/blog/nix-flakes-explained/
1•fangpenlin•21m ago•0 comments

FDA Issues 'Early Alert' for Abbot's FreeStyle Libre 3 Diabetes Sensors

https://www.fda.gov/medical-devices/medical-device-recalls-and-early-alerts/early-alert-glucose-m...
1•samdung•22m ago•0 comments