Some of the questions I have in mind are more:
1. How do large LLM providers handle the flow of training data, evaluation results and human feedback? Are these managed through event streams (like Kafka) for real-time processing or do they rely more on batch processing and traditional ETL pipelines?
2. For complex ML pipelines with deps (eg. data ingestion -> preprocessing -> training -> evaluation -> deployment), do they use event-driven orchestration where each stage publishes some completion events or do they use traditional workflow orchestrators like Airflow with polling-based dependency management?
3. How do they handle real-time performance monitoring and safety signals? Are these event-driven systems that can trigger immediate responses (like model rollbacks) or are they primarily batch analytics with some delayed reactions?
I'm basically trying to understand how far the event-driven paradigm fits in modern AI infra and I would love any high-level insights if someone is (or has been) working with it.