frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

The Equality Delete Problem in Apache Iceberg

https://blog.dataengineerthings.org/the-equality-delete-problem-in-apache-iceberg-143dd451a974
42•dkgs•7h ago

Comments

datadrivenangel•6h ago
Change Data Capture is hard if you fall off the happy path, and data lakes won't save you.
UltraSane•1h ago
Could you explain what this means in more detail?
sakesun•1h ago
This is very obvious. Synchronization based on sequential change log is brittle by nature.
UltraSane•1h ago
What are the most common failure modes?

Whenever I implement CDC I always try to implement some kind of integrity check that runs at some reasonable interval that can detect and try to fix any discrepancies. CDC is by far the most efficient form of data synchronization in most situations.

sakesun•1h ago
You are doing it right for having periodic check. Why you would want to know what common failure modes are then ?
dkdcio•5h ago
> Databricks recently spent $1 billion to acquire Neon, a startup building a serverless Postgres. Snowflake also spent about $250 million to acquire Crunchy Data, a veteran enterprise-grade Postgres provider.

It's kinda funny to not mention that Databricks acquired Tabular, the Iceberg company, for a billion dollars: https://www.databricks.com/company/newsroom/press-releases/d...

kwillets•5h ago
Another chapter of the slowly-reimplementing-Vertica saga.

It's becoming clear that merge trees and compaction need to be addressed next, after delete vectors brought them onstage.

Vertica will actually look up the equality keys in a relevant projection if it exists, and then use the column values in the matching rows to equality-delete from the other projections; it's fairly good at avoiding table scans.

datadrivenangel•1h ago
Data processing tools had a pricing problem. The Big Data Revolution was google and other companies realizing that commodity hardware had gotten so good that you could throw 100x as much compute at a processing job and it would still be cheaper than Oracle, Vertica, Teradata, and SQL Server.

As an industry, we keep forgetting these things and reinventing the wheel because there is more money to be made squeezing enterprises than providing widely available sustainable software for a fair price and than losing mindshare to the next generation of tool and eventually getting sold for parts. It's a sad dynamic

ajd555•4h ago
> Postgres and Apache Iceberg are both mature systems

Apache Iceberg as mature? I mean, there's a lot of activity around it, but I remember a year ago the rust library didn't even have write capabilities. And it's not like the library is a client and there's an iceberg server - the library literally is the whole product, interacting with the files in s3

ajd555•4h ago
I suppose, in fairness, the Java library has been around for much longer
jsight•4h ago
A lot of people will spend dozens of hours and tens of thousands of their company's money to avoid learning Java.

I'm not even sure if I'm joking. :)

kwillets•3h ago
This is data engineering, where people spend thousands of dollars of their company's money to avoid learning SQL. The place with no Java is across the street (old Soviet joke, originally for meat/fish stores).
icedchai•3h ago
Sad but true. Or they learn "something" about SQL but not about indexes, data types, joins, or even aggregate functions. I've seen some python horror shows that would select * entire tables into lists of dicts, only to do the equivalent of a where clause and a couple of sums.
jsight•1h ago
Ouch. I was really hoping this wasn't so common, but I guess it is. I'm sure they heard somewhere that "joins are expensive" or something along those lines, so they solved it!
pat2man•3h ago
I mean RisingWave, the solution mentioned in the article, is a complete startup rewriting things in Rust mostly to avoid the larger Java solutions like Flink and Spark...
datadrivenangel•1h ago
Flink and Spark are painful for streaming.

They do work, but they have some sharp and rough edges.

nxm•2h ago
Most tools the data ecosystem support Iceberg as first class citizen, including major analytical query engines. Lots of development in this space has happened over the past year in particular.
amluto•4h ago
I don't really get it. If I'm understanding correctly, the goal of these CDC-to-Iceberg systems is to mirror, in near real-time, a Postgres table into an Iceberg database. The article states, repeatedly:

> In streaming CDC scenarios, however, you’d need to query Iceberg for the location on every delete: introducing random reads, latency, and drastically lowering throughput under high concurrency. On large tables, real-time performance is essentially impossible.

Let's consider the actual situation. There's a Postgres table that fits on whatever Postgres server is in use. It gets mirrored to Iceberg. Postgres is a full-fledged relational database and has indexes and such. Iceberg is not, although it can be scanned much faster than Postgres and queried by fancy Big Data tools (which, I agree, are really cool!). And, notably, there is no index mapping Postgres rows to Iceberg row positions.

But why isn't there? CDC is inherently stateful -- unless someone is going to build Merkle trees or similar to allow efficiently diffing table states (which would be awesome), the CDC process need to keep enough state to know where it is. Maybe this is O(1) in current implementations. But why not keep the entire mapping from Postgres rows to Iceberg positions? The Postgres database table is about N rows times however wide a row is, and it fits on a Postgres server. The mapping needed would be about the size of a single index on the table. Why not store it somewhere? Updates to it will be faster than updates to the source Postgres table, so it will keep up. Is the problem that this is awkward to do in a "serverless" manner?

For extra fun, someone could rig up Postgres (via an extension or just some clever tables) so that the mapping is stored in Postgres itself. It would be, roughly, one small table with CDC state and one moderate size table per mirrored table storing the row position mapping. It could be on the same server instance or a different one.

slt2021•3h ago
this use case of postgres + CDC + iceberg feel like the wrong architecture.

postgres is for relational data, ok

CDC is meant to capture changes and process the changes only (in isolation from all previous changes), not to recover the snapshot of the original table by reimplementing the logic inside postgres of merge-on-read

iceberg is columnar storage for large historical data for analytics, its not meant for relational data, and certainly not for realtime

it looks like they need to use time-series oriented db, like timescale, influxdb, etc

nxm•2h ago
The goal is data replication into the data lake, and not in real-time. CDC is just a means to and end.
datadrivenangel•1h ago
A table file format format like Iceberg is a relational tabular format.

With a query engine that supports federation, we can write SELECT * FROM PG_Table or SELECT * FROM Iceberg_File just the same.

Claude Sonnet 4 now supports 1M tokens of context

https://www.anthropic.com/news/1m-context
890•adocomplete•9h ago•492 comments

Search all text in New York City

https://www.alltext.nyc/
63•Kortaggio•1h ago•14 comments

Scapegoating the Algorithm

https://asteriskmag.com/issues/11/scapegoating-the-algorithm
33•fmblwntr•2h ago•16 comments

Ashet Home Computer

https://ashet.computer/
189•todsacerdoti•6h ago•41 comments

Show HN: Building a web search engine from scratch with 3B neural embeddings

https://blog.wilsonl.in/search-engine/
328•wilsonzlin•9h ago•57 comments

Journaling using Nix, Vim and coreutils

https://tangled.sh/@oppi.li/journal
76•icy•11h ago•23 comments

A gentle introduction to anchor positioning

https://webkit.org/blog/17240/a-gentle-introduction-to-anchor-positioning/
41•feross•3h ago•10 comments

Training language models to be warm and empathetic makes them less reliable

https://arxiv.org/abs/2507.21919
206•Cynddl•12h ago•210 comments

Show HN: Omnara – Run Claude Code from anywhere

https://github.com/omnara-ai/omnara
207•kmansm27•9h ago•100 comments

Multimodal WFH setup: flight SIM, EE lab, and music studio in 60sqft/5.5M²

https://www.sdo.group/study
181•brunohaid•3d ago•78 comments

Blender is Native on Windows 11 on Arm

https://www.thurrott.com/music-videos/324346/blender-is-native-on-windows-11-on-arm
115•thunderbong•3d ago•42 comments

AI Eroded Doctors' Ability to Spot Cancer Within Months in Study

https://www.bloomberg.com/news/articles/2025-08-12/ai-eroded-doctors-ability-to-spot-cancer-within-months-in-study
30•zzzeek•58m ago•19 comments

The Missing Protocol: Let Me Know

https://deanebarker.net/tech/blog/let-me-know/
75•deanebarker•5h ago•52 comments

WHY2025: How to become your own ISP [video]

https://media.ccc.de/v/why2025-9-how-to-become-your-own-isp
93•exiguus•8h ago•13 comments

Launch HN: Design Arena (YC S25) – Head-to-head AI benchmark for aesthetics

61•grace77•9h ago•23 comments

LLMs aren't world models

https://yosefk.com/blog/llms-arent-world-models.html
225•ingve•2d ago•115 comments

Go 1.25 Release Notes

https://go.dev/doc/go1.25
111•bitbasher•4h ago•10 comments

Why are there so many rationalist cults?

https://asteriskmag.com/issues/11/why-are-there-so-many-rationalist-cults
383•glenstein•10h ago•584 comments

The Equality Delete Problem in Apache Iceberg

https://blog.dataengineerthings.org/the-equality-delete-problem-in-apache-iceberg-143dd451a974
42•dkgs•7h ago•21 comments

RISC-V single-board computer for less than 40 euros

https://www.heise.de/en/news/RISC-V-single-board-computer-for-less-than-40-euros-10515044.html
126•doener•4d ago•72 comments

Debian GNU/Hurd 2025 released

https://lists.debian.org/debian-hurd/2025/08/msg00038.html
180•jrepinc•3d ago•93 comments

Visualizing quaternions, an explorable video series

https://eater.net/quaternions
3•uncircle•3d ago•0 comments

Dumb to managed switch conversion (2010)

https://spritesmods.com/?art=rtl8366sb&page=1
34•userbinator•3d ago•15 comments

Weave (YC W25) is hiring a founding AI engineer

https://www.ycombinator.com/companies/weave-3/jobs/SqFnIFE-founding-ai-engineer
1•adchurch•8h ago

Fixing a loud PSU fan without dying

https://chameth.com/fixing-a-loud-psu-fan-without-dying/
14•sprawl_•3d ago•15 comments

Galileo’s telescopes: Seeing is believing (2010)

https://www.historytoday.com/archive/history-matters/galileos-telescopes-seeing-believing
14•hhs•3d ago•4 comments

Nexus: An Open-Source AI Router for Governance, Control and Observability

https://nexusrouter.com/blog/introducing-nexus-the-open-source-ai-router
81•mitchwainer•11h ago•21 comments

Australian court finds Apple, Google guilty of being anticompetitive

https://www.ghacks.net/2025/08/12/australian-court-finds-apple-google-guilty-of-being-anticompetitive/
322•warrenm•12h ago•119 comments

How to safely escape JSON inside HTML SCRIPT elements

https://sirre.al/2025/08/06/safe-json-in-script-tags-how-not-to-break-a-site/
69•dmsnell•4d ago•40 comments

Comparing baseball greats across eras, who comes out on top?

https://phys.org/news/2025-07-baseball-greats-eras.html
6•PaulHoule•2d ago•13 comments