frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Start all of your commands with a comma

https://rhodesmill.org/brandon/2009/commands-with-comma/
163•theblazehen•2d ago•47 comments

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

https://openciv3.org/
674•klaussilveira•14h ago•202 comments

The Waymo World Model

https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-frontier-for-autonomous-driving-simula...
950•xnx•20h ago•552 comments

How we made geo joins 400× faster with H3 indexes

https://floedb.ai/blog/how-we-made-geo-joins-400-faster-with-h3-indexes
123•matheusalmeida•2d ago•33 comments

Jeffrey Snover: "Welcome to the Room"

https://www.jsnover.com/blog/2026/02/01/welcome-to-the-room/
22•kaonwarb•3d ago•19 comments

Unseen Footage of Atari Battlezone Arcade Cabinet Production

https://arcadeblogger.com/2026/02/02/unseen-footage-of-atari-battlezone-cabinet-production/
58•videotopia•4d ago•2 comments

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

https://github.com/valdanylchuk/breezydemo
232•isitcontent•14h ago•25 comments

Monty: A minimal, secure Python interpreter written in Rust for use by AI

https://github.com/pydantic/monty
225•dmpetrov•15h ago•118 comments

Show HN: I spent 4 years building a UI design tool with only the features I use

https://vecti.com
332•vecti•16h ago•144 comments

Hackers (1995) Animated Experience

https://hackers-1995.vercel.app/
495•todsacerdoti•22h ago•243 comments

Sheldon Brown's Bicycle Technical Info

https://www.sheldonbrown.com/
383•ostacke•20h ago•95 comments

Microsoft open-sources LiteBox, a security-focused library OS

https://github.com/microsoft/litebox
360•aktau•21h ago•182 comments

Show HN: If you lose your memory, how to regain access to your computer?

https://eljojo.github.io/rememory/
289•eljojo•17h ago•175 comments

An Update on Heroku

https://www.heroku.com/blog/an-update-on-heroku/
413•lstoll•21h ago•279 comments

Vocal Guide – belt sing without killing yourself

https://jesperordrup.github.io/vocal-guide/
32•jesperordrup•4h ago•16 comments

Was Benoit Mandelbrot a hedgehog or a fox?

https://arxiv.org/abs/2602.01122
20•bikenaga•3d ago•8 comments

Where did all the starships go?

https://www.datawrapper.de/blog/science-fiction-decline
17•speckx•3d ago•7 comments

PC Floppy Copy Protection: Vault Prolok

https://martypc.blogspot.com/2024/09/pc-floppy-copy-protection-vault-prolok.html
63•kmm•5d ago•7 comments

Dark Alley Mathematics

https://blog.szczepan.org/blog/three-points/
91•quibono•4d ago•21 comments

How to effectively write quality code with AI

https://heidenstedt.org/posts/2026/how-to-effectively-write-quality-code-with-ai/
258•i5heu•17h ago•196 comments

Delimited Continuations vs. Lwt for Threads

https://mirageos.org/blog/delimcc-vs-lwt
32•romes•4d ago•3 comments

What Is Ruliology?

https://writings.stephenwolfram.com/2026/01/what-is-ruliology/
44•helloplanets•4d ago•42 comments

Introducing the Developer Knowledge API and MCP Server

https://developers.googleblog.com/introducing-the-developer-knowledge-api-and-mcp-server/
60•gfortaine•12h ago•26 comments

I now assume that all ads on Apple news are scams

https://kirkville.com/i-now-assume-that-all-ads-on-apple-news-are-scams/
1070•cdrnsf•1d ago•446 comments

Female Asian Elephant Calf Born at the Smithsonian National Zoo

https://www.si.edu/newsdesk/releases/female-asian-elephant-calf-born-smithsonians-national-zoo-an...
36•gmays•9h ago•12 comments

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

https://infisical.com/blog/devops-to-solutions-engineering
150•vmatsiiako•19h ago•70 comments

Understanding Neural Network, Visually

https://visualrambling.space/neural-network/
288•surprisetalk•3d ago•43 comments

Why I Joined OpenAI

https://www.brendangregg.com/blog/2026-02-07/why-i-joined-openai.html
150•SerCe•10h ago•142 comments

Learning from context is harder than we thought

https://hy.tencent.com/research/100025?langVersion=en
186•limoce•3d ago•100 comments

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

https://github.com/phreda4/r3
73•phreda4•14h ago•14 comments
Open in hackernews

The equality delete problem in Apache Iceberg

https://blog.dataengineerthings.org/the-equality-delete-problem-in-apache-iceberg-143dd451a974
65•dkgs•5mo ago

Comments

datadrivenangel•5mo ago
Change Data Capture is hard if you fall off the happy path, and data lakes won't save you.
UltraSane•5mo ago
Could you explain what this means in more detail?
sakesun•5mo ago
This is very obvious. Synchronization based on sequential change log is brittle by nature.
UltraSane•5mo ago
What are the most common failure modes?

Whenever I implement CDC I always try to implement some kind of integrity check that runs at some reasonable interval that can detect and try to fix any discrepancies. CDC is by far the most efficient form of data synchronization in most situations.

sakesun•5mo ago
You are doing it right for having periodic check. Why you would want to know what common failure modes are then ?
phanimahesh•5mo ago
Intellectual curiosity and a desire to learn something, to exchange ideas and learn if there's something they haven't thought of, etc?
UltraSane•5mo ago
not having a error detection method is like using a stepper motor with open-loop control. Not something any good engineer will accept.
halfcat•5mo ago
The point is, you can’t do synchronization reliably using any message/event-based approach alone. You always need a reconciliation mechanism, as you’ve correctly noticed.

It’s always shocking to me how many FAANG people will say, “we want an event driven solution with a message bus, that the right way to do it, we don’t want batch, that can’t scale”, and then need to bolt on a validation/reconciliation step for it to be reliable. Which of course is a batch job.

Unless you control a system end to end (which is rare, there’s usually some data from a system you don’t control the schema of), or are highly incentivized to make sync happen reliably (e.g. bitcoin), you’re always limited by a batch job somewhere in the system.

You could even say the batch job is often doing the heavy lifting, and the message bus is just an optimization (that’s often not adding the value needed to justify the complexity).

UltraSane•5mo ago
Using something like Kafka you can get reliable at least once messaging and then you just need to make the CDC updates idempotent.

I'm not sure exactly what you mean by batch job but if n bytes change at the source you shouldn't have to copy more than n bytes to the destination.

halfcat•5mo ago
If you can get all of your data efficiently (and transactionally consistent) into Kafka, that’s the scenario I mention where you have control of all your systems.

Inevitably, even if you achieve this at some point, it never lasts. Your company acquires another, or someone pushes for a different HR/CRM/whatever system and gets it.

You mention if n bytes change in the source, but many systems have no mechanism of determining that n bytes have changed without scanning the entire data set. So we’re back to a batch job (cron, or similar).

UltraSane•5mo ago
"many systems have no mechanism of determining that n bytes have changed without scanning the entire data set."

This is so insanely inefficient it can't scale to very large amounts of data. If you can't do data syncing at the application layer you can do it at the storage layer with high end storage arrays that duplicate all writes to a second storage array, either synchronously or asynchronously. Or duplicate snapshots to a remote array. They work really well.

datadrivenangel•5mo ago
Integrity is the hard part. Usually hard deletes in the source system are hard to propagate well because full deletion is awkward when you only expect to *update* data.
dkdcio•5mo ago
> Databricks recently spent $1 billion to acquire Neon, a startup building a serverless Postgres. Snowflake also spent about $250 million to acquire Crunchy Data, a veteran enterprise-grade Postgres provider.

It's kinda funny to not mention that Databricks acquired Tabular, the Iceberg company, for a billion dollars: https://www.databricks.com/company/newsroom/press-releases/d...

kwillets•5mo ago
Another chapter of the slowly-reimplementing-Vertica saga.

It's becoming clear that merge trees and compaction need to be addressed next, after delete vectors brought them onstage.

Vertica will actually look up the equality keys in a relevant projection if it exists, and then use the column values in the matching rows to equality-delete from the other projections; it's fairly good at avoiding table scans.

datadrivenangel•5mo ago
Data processing tools had a pricing problem. The Big Data Revolution was google and other companies realizing that commodity hardware had gotten so good that you could throw 100x as much compute at a processing job and it would still be cheaper than Oracle, Vertica, Teradata, and SQL Server.

As an industry, we keep forgetting these things and reinventing the wheel because there is more money to be made squeezing enterprises than providing widely available sustainable software for a fair price and than losing mindshare to the next generation of tool and eventually getting sold for parts. It's a sad dynamic

kwillets•5mo ago
Vertica runs on commodity hardware, and there's no license cost for cpu, so it's very economical for large workloads. The quickest way to 10x your costs is to move a Vertica workload to Snowflake (last I heard my old job is now up to 40x). I'll have numbers on Databricks in a few months.

It's true that Vertica sales are optimized for large enterprises -- they just don't have the VC cash to hire 3000 sales people to sell it to the low end, so it doesn't appear on many people's radar.

ajd555•5mo ago
> Postgres and Apache Iceberg are both mature systems

Apache Iceberg as mature? I mean, there's a lot of activity around it, but I remember a year ago the rust library didn't even have write capabilities. And it's not like the library is a client and there's an iceberg server - the library literally is the whole product, interacting with the files in s3

ajd555•5mo ago
I suppose, in fairness, the Java library has been around for much longer
_ea1k•5mo ago
A lot of people will spend dozens of hours and tens of thousands of their company's money to avoid learning Java.

I'm not even sure if I'm joking. :)

kwillets•5mo ago
This is data engineering, where people spend thousands of dollars of their company's money to avoid learning SQL. The place with no Java is across the street (old Soviet joke, originally for meat/fish stores).
icedchai•5mo ago
Sad but true. Or they learn "something" about SQL but not about indexes, data types, joins, or even aggregate functions. I've seen some python horror shows that would select * entire tables into lists of dicts, only to do the equivalent of a where clause and a couple of sums.
_ea1k•5mo ago
Ouch. I was really hoping this wasn't so common, but I guess it is. I'm sure they heard somewhere that "joins are expensive" or something along those lines, so they solved it!
solid_fuel•5mo ago
Prisma built a whole ORM around that, complete with the client-side joins.
icedchai•5mo ago
There are also other "lets pretend we have big data" patterns I've seen. Running something like AWS Athena or Hadoop over and over, for less than a couple gigs of data that would fit into memory on a laptop from 15 years ago.
joseda-hg•5mo ago
Classic, I remember seeing something like this

SELECT * FROM X, into a C# list, Filter it with LINQ, and then use C# to do calculations

"Why not EF?" "Someone told me it was slower"

pat2man•5mo ago
I mean RisingWave, the solution mentioned in the article, is a complete startup rewriting things in Rust mostly to avoid the larger Java solutions like Flink and Spark...
datadrivenangel•5mo ago
Flink and Spark are painful for streaming.

They do work, but they have some sharp and rough edges.

fifilura•5mo ago
I think 90% of the use cases for streaming data is because managers have nothing better to do than to reload their profits dashboard.

They are rarely supported by real business value.

Save the data as soon as it comes in and transform it in batch mode.

nxm•5mo ago
Most tools the data ecosystem support Iceberg as first class citizen, including major analytical query engines. Lots of development in this space has happened over the past year in particular.
amluto•5mo ago
I don't really get it. If I'm understanding correctly, the goal of these CDC-to-Iceberg systems is to mirror, in near real-time, a Postgres table into an Iceberg database. The article states, repeatedly:

> In streaming CDC scenarios, however, you’d need to query Iceberg for the location on every delete: introducing random reads, latency, and drastically lowering throughput under high concurrency. On large tables, real-time performance is essentially impossible.

Let's consider the actual situation. There's a Postgres table that fits on whatever Postgres server is in use. It gets mirrored to Iceberg. Postgres is a full-fledged relational database and has indexes and such. Iceberg is not, although it can be scanned much faster than Postgres and queried by fancy Big Data tools (which, I agree, are really cool!). And, notably, there is no index mapping Postgres rows to Iceberg row positions.

But why isn't there? CDC is inherently stateful -- unless someone is going to build Merkle trees or similar to allow efficiently diffing table states (which would be awesome), the CDC process need to keep enough state to know where it is. Maybe this is O(1) in current implementations. But why not keep the entire mapping from Postgres rows to Iceberg positions? The Postgres database table is about N rows times however wide a row is, and it fits on a Postgres server. The mapping needed would be about the size of a single index on the table. Why not store it somewhere? Updates to it will be faster than updates to the source Postgres table, so it will keep up. Is the problem that this is awkward to do in a "serverless" manner?

For extra fun, someone could rig up Postgres (via an extension or just some clever tables) so that the mapping is stored in Postgres itself. It would be, roughly, one small table with CDC state and one moderate size table per mirrored table storing the row position mapping. It could be on the same server instance or a different one.

oconnore•5mo ago
This is a lot of fuss when you can get a batch update to stay within a few minutes of latency. You only have this problem if you are very insistent on both (1) very near real-time, and (2) Iceberg. And you can't go down this path if you require transactional queries.

I think most people who need very near real-time queries also tend to need them to be transactional. The use case where you can accept inconsistent reads but something will break if you're 3 minutes out of date is very rare.

amluto•5mo ago
What do you mean “transactional”? Do you mean that a reader never sees a state in Iceberg that is not consistent with a state that could have been seen in a successful transaction in Postgres? If so, that seems fairly straightforward: have the CDC process read from Postgres in a transaction and write to Iceberg in a transaction. (And never do anything in the Postgres transaction that could cause it to fail to commit.)

But the 3 minute thing seems somewhat immaterial to me. If I have a table with one billion rows, and I do an every-three-minute batch job that need to sync an average of one modified row to Iceberg, that job still needs write the correct deletion record to Iceberg. If there’s no index, then either the job writes a delete-by-key or the job need to scan 1B Iceberg rows. Sure, that’s doable in 3 minutes, but it’s far from free.

amluto•5mo ago
> This is a lot of fuss when you can get a batch update to stay within a few minutes of latency.

Replying again to add: cost. Just because you can do a batch update every few minutes by doing a full scan of the primary key column of your Iceberg table and joining against your list of modified or deleted primary keys does not mean you should. That table scan costs actual money if the Iceberg table is hosted somewhere like AWS or uses a provider like Databricks, and running a full column scan every three minutes could be quite pricey.

slt2021•5mo ago
this use case of postgres + CDC + iceberg feel like the wrong architecture.

postgres is for relational data, ok

CDC is meant to capture changes and process the changes only (in isolation from all previous changes), not to recover the snapshot of the original table by reimplementing the logic inside postgres of merge-on-read

iceberg is columnar storage for large historical data for analytics, its not meant for relational data, and certainly not for realtime

it looks like they need to use time-series oriented db, like timescale, influxdb, etc

nxm•5mo ago
The goal is data replication into the data lake, and not in real-time. CDC is just a means to and end.
slt2021•5mo ago
one problem with replicating mutation changes to the data lake is it may lose historical data.

let's say e-shop customer changes his home address from WA state to CA, if you replicate this new address and delete old address in compaction process, all past transactions may now be associated with the new address, which can lead to wrong conclusions or distortions of past historical reports (WA sales from past month will not be attributed to CA sales)

datadrivenangel•5mo ago
A table file format format like Iceberg is a relational tabular format.

With a query engine that supports federation, we can write SELECT * FROM PG_Table or SELECT * FROM Iceberg_File just the same.

salmonellaeater•5mo ago
It's the wrong architecture from a dependency management perspective. Directly importing a table into Iceberg allows analytics consumers to take dependencies on it. This means the Postgres database schema can't be changed without breaking those consumers. This is effectively a multi-tenant database with extra steps.

This is not to say that this architecture isn't salvageable - if the only consumer of the Iceberg table copy is a e.g. view that downstream consumers must use, then it's easier to change the Postgres schema, as only the view must be adjusted. My experience with copying tables directly to a data warehouse using CDC, though, suggests it's hard to prevent erosion of the architecture as high-urgency projects start taking direct dependencies to save time.

code_biologist•5mo ago
Eh, as long as it isn't life or death, I think allowing direct consumption and explicitly agreeing that breakage is a consumer problem is better for most business use cases (less code, easier to maintain and evolve). If you make a breaking schema change and nobody complains, is it really breaking?

I have spent way too much life maintaining consumer shield views and answering hairy schema translation questions for use cases so unimportant the downstream business user forgot they even had the view.

Important downstream data consumers almost always have monitoring/alerting set up (if it's not important enough to have those, it's not important) and usually the business user cares about integrity enough to help data teams set up CI. Even in these cases, where the business user cares a lot, I've still found shield views to be of limited utility versus just letting the schema change hit the downstream system and letting them handle it as they see fit, as long as they're prepared for it.

> it's hard to prevent erosion of the architecture as high-urgency projects start taking direct dependencies to save time.

IME, it feels wrong, but it mostly does end up saving time with few consequences. Worse is better.

rubenvanwyk•5mo ago
Reads a bit like a puff piece but can’t deny the facts: RisingWave is an amazing product. I wonder when the industry is going to realise in event-driven architectures, you can actually skip having Postgres entirely and just have RisingWave / Feldera + columnar data on object storage.
datadrivenangel•5mo ago
Postgres + DuckDB is going to kill a lot of the need for streaming CDC. Overall the market will get bigger so there's no need for concern.
hodgesrm•5mo ago
I don't really understand what problem replication of mutable transaction data to Iceberg or any data lake for that matter solves for most PostgreSQL users.

Iceberg is optimized for fact data in very large tables with relatively rare changes and likewise rare change to schema. It does that well and will continue to do so for the foreseeable future.

PostgreSQL databases typically don't generate huge amounts of data; that data can also be highly mutable in many cases. Not only that, the schema can change substantially. Both types of changes are hard to manage in replication, especially if the target is a system, like Iceberg, that does not handle change very well in the first place.

So that leaves the case where you have an lot of data in PostgreSQL that's creating bad economics. In that case, why not just skip PostgreSQL and put it in an analytic database to begin with?

p.s., I'm pretty familiar with trading systems that do archive transaction data to data lakes using Parquet for long-term analytics and compliance. That is a different problem. The data is for all intents and purposes immutable.

Edit: clarity

gunnarmorling•5mo ago
> PostgreSQL databases typically don't generate huge amounts of data

The live data set may not be huge, but the entire trail of all changes of all current and all previously existing data may easily exceed the volume of data you can reasonably process with Postgres.

In addition, its row based storage format doesn't make it an ideal fit for typical analytical queries on large amounts of data.

Replicating the data from Postgres to Iceberg addresses these issues. But, of course, it's not without its own challenges, as demonstrated by the article.

hodgesrm•5mo ago
Right, that's the archive case. One way to handle that is to replicate the log records and let users figure it out in the analytic storage. That also allows time travel. Either way the data should be frozen or it makes the problem way harder.
hbarka•5mo ago
No mention of idempotency.