Some of the questions I have in mind are more:
1. How do large LLM providers handle the flow of training data, evaluation results and human feedback? Are these managed through event streams (like Kafka) for real-time processing or do they rely more on batch processing and traditional ETL pipelines?
2. For complex ML pipelines with deps (eg. data ingestion -> preprocessing -> training -> evaluation -> deployment), do they use event-driven orchestration where each stage publishes some completion events or do they use traditional workflow orchestrators like Airflow with polling-based dependency management?
3. How do they handle real-time performance monitoring and safety signals? Are these event-driven systems that can trigger immediate responses (like model rollbacks) or are they primarily batch analytics with some delayed reactions?
I'm basically trying to understand how far the event-driven paradigm fits in modern AI infra and I would love any high-level insights if someone is (or has been) working with it.
enether•6mo ago
I watched the talk live. They didn't exactly share what it's used for, but it was clear it was ubiquitous and that they invested heavily in a) availability, because it was critical for them and b) simplicity, because they wanted to onboard as many teams on it as possible
1 - https://x.com/BdKozlovski/status/1924838299706790307 2 - https://current.confluent.io/archive/2025/london (but it's contact-walled)