Show HN: 7x faster Iceberg ingestion, how we redesigned OLake's writer

https://olake.io/blog/how-olake-becomes-7x-faster/

3•rohankhameshra•56m ago

OLake is our open-source tool for ingesting Database & Kafka data into Apache Iceberg. We recently redesigned the write pipeline and saw ~7x throughput improvements. Sharing the architecture decisions, trade-offs, and benchmarks.

Comments

rohankhameshra•52m ago

Hi everyone, I’m one of the founders of OLake. We’ve been working on a high-throughput, open-source ingestion path for Apache Iceberg, and I wanted to share the latest benchmark results and the architectural changes behind them. Here are the key numbers from the benchmark run:

- On a 4.01 billion-row dataset, OLake sustained around 319,562 rows/sec (full-load) from Postgres into Iceberg.

- The next-best ingestion tool we tested on the same dataset managed about 46,000 rows/sec, making OLake roughly 6.8× faster for full loads.

- For CDC workloads, OLake ingested change batches at around 41,390 rows/sec, compared to ~26,900 rows/sec for the closest alternative.

- Average memory usage was about 44 GB, peaking at ~59 GB on a 64-vCPU / 128 GB RAM VM.

- Parquet file output stabilized at ~300–400 MB per file (after compression), improving performance downstream and avoiding “small file” fragmentation.

How we got these improvements:

1. Rewrote the writer architecture: Parsing, schema evolution, buffering, and batch management now happen in Go. Only the final Parquet and Iceberg write path uses Java. This cut down huge amounts of serialization and JVM churn.

2. Introduced a new batching and buffering model: Instead of producing many small Parquet files, we buffer data in memory per thread and commit large chunks (roughly 4 GB before compression). This keeps throughput high and files uniform.

3. Optimized Iceberg metadata operations: Commits remain atomic even with large batches, and schema evolution happens fully in Go before any write, reducing cross-system coordination.

4. Improved operational stability: CPU, memory, and disk behaviour remained predictable even at multi-billion-row scales.

Benchmark setup:

- Dataset: ~4.01 billion rows from the NYC Taxi + FHV trips parquet sets (row width ~120–144 bytes).

- Test machine: Azure Standard D64ls v5 (64 vCPUs, 128 GB RAM).

Iceberg stored on local NVMe for the benchmark, same architecture works with S3/GCS/HDFS.

The full benchmark results, methodology, and configs are here: https://olake.io/docs/benchmarks/

And the deep-dive into how we got the ~7× speedup is here: https://olake.io/blog/how-olake-becomes-7x-faster/

I’d love feedback from the HN community, specifically around tuning (batch sizes, commit frequency, partitioning strategies), Iceberg best practices, and real-world constraints you’ve seen in high-volume pipelines.

Happy to answer questions or share configs. Thanks for taking a look!

The AI Backlash Is Here: Why Public Patience with Tech Giants Is Running Out

Ind-QwenTTS – TTS for 'Your Computer Has a Virus' in Authentic Indian Accent

Why Most Startups Shouldn't Raise Venture Capital

Valve rejoins the VR hardware wars with standalone Steam Frame

Machine Learning for Scientific Discovery

Driverless Waymo vehicle goes through tense police stop in L.A

CUDA Tile

Startups on hard mode: Oxide. Part 1: Hardware (2024)

Trump administration orders enhanced vetting for applicants of H-1B visa

The Forge Calculator

Hungarian Notation

$1M Paid to Developers Who Built Railway Templates

Censorship Whac-A-Mole: Google search exploited to scrub article on SF tech exec

Vpternlog: Signed Saturation

Honest Reviews – what have I done?

A 2025 Survey of Rust GUI Libraries

Gitmal

Science E-Books

Tony Fadell, iPod co-creator, might want to be Apple's next CEO

We Use API Agents to Build Integrations Fast

Improving Cursor's agent for OpenAI Codex models

OpenAI's GPT-5.2 'code red' response to Google is coming next week

Chesterton's Fence and the "No Magic" Approach to AI Data

When a video codec wins an Emmy

Absurdities and contradictions of my career in crypto

Feeds, feelings & focus: Cognitive & mental health links to short-form video use

Drone on Drone Battles in Ukraine

The Reverse-Centaur's Guide to Criticizing AI (05 Dec 2025)

Nix flakes explained: what they solve, why they matter, and the future

FDA Issues 'Early Alert' for Abbot's FreeStyle Libre 3 Diabetes Sensors