Video (29 min): https://www.youtube.com/watch?v=qqtE_BrjVkM
If you are someone who prefer text, here’s the quick TLDR;
Why Debezium became a drag for them: 1. Long full loads on multi-million-row MongoDB collections, and any failure meant restarting from scratch 2. Kafka and Connect infrastructure felt heavy when the end goal was “Parquet/Iceberg on S3” 3. Handling heterogeneous arrays required custom SMTs 4. Continuous streaming only; they still had to glue together ad-hoc batch pulls for some workflows 5. Ongoing schema drift demanded extra code to keep Iceberg tables aligned
What changed with OLake? -> Writes directly from MongoDB (and friends) into Apache Iceberg, no message broker in between
-> Two modes: full load for the initial dump, then CDC for ongoing changes — exposed by a single flag in the job config -> Automatic schema evolution: new MongoDB fields appear as nullable columns; complex sub-docs land as JSON strings you can parse later
-> Resumable, chunked full loads: a pod crash resumes instead of restarting
-> Runs as either a Kubernetes CronJob or an Airflow task; config is one YAML/JSON file.
Their stack in one line: MongoDB → OLake writer → Iceberg on S3 → Spark jobs → Trino / occasional Redshift, all orchestrated by Airflow and/or K8s.
Posting here because many of us still bolt Kafka onto CDC just to land files. If you only need Iceberg tables, a simpler path might exist now. Curious to hear others’ experiences with broker-less CDC tools.
(Disclaimer: I work on OLake and hosted the meetup, but the talk is purely technical.)
Check out github repo - https://github.com/datazip-inc/olake