We process video, images, and documents through 20+ ML models simultaneously at Mixpeek. A single 10-minute video triggers transcription, visual embeddings, scene descriptions, face detection, object detection, brand safety classification, and more — all in parallel with different compute requirements.
We wrote up the full Ray architecture we use in production on KubeRay/GKE. Not a tutorial — more of a "here's what we actually run and what bit us."
Some highlights:
- *Custom resource isolation* — We use a synthetic `{"batch": 1}` resource to prevent batch pipeline tasks from starving Ray Serve inference replicas. Same cluster, zero interference, no runtime overhead.
- *Flexible actor pools* — Fixed-size `ActorPoolStrategy(size=8)` deadlocks when concurrent jobs compete for workers. `min_size=1, max_size=N` guarantees every job can make progress.
- *Shared preprocessing* — Naive approach runs S3 download + format normalization once per extractor. With 10 extractors on 1,000 files, that's 10,000 redundant reads. We preprocess once and fan out via Ray Dataset.
- *Distributed Qdrant writes* — Ray Data's `Datasink` API distributes vector DB writes across all workers with backpressure, instead of collecting everything on one node.
- *Fire-and-forget progress tracking* — A Ray actor as a shared counter lets workers report progress without blocking the pipeline.
- *Zero-CPU head node* — Learned this one the hard way when a runaway batch job took down our scheduler.
The post includes the KubeRay YAML, Ray Serve autoscaling configs, pipeline code, and the LocalStack parquet workaround that saved us hours of debugging silent hangs.
Beefin•1h ago
We wrote up the full Ray architecture we use in production on KubeRay/GKE. Not a tutorial — more of a "here's what we actually run and what bit us."
Some highlights:
- *Custom resource isolation* — We use a synthetic `{"batch": 1}` resource to prevent batch pipeline tasks from starving Ray Serve inference replicas. Same cluster, zero interference, no runtime overhead.
- *Flexible actor pools* — Fixed-size `ActorPoolStrategy(size=8)` deadlocks when concurrent jobs compete for workers. `min_size=1, max_size=N` guarantees every job can make progress.
- *Shared preprocessing* — Naive approach runs S3 download + format normalization once per extractor. With 10 extractors on 1,000 files, that's 10,000 redundant reads. We preprocess once and fan out via Ray Dataset.
- *Distributed Qdrant writes* — Ray Data's `Datasink` API distributes vector DB writes across all workers with backpressure, instead of collecting everything on one node.
- *Fire-and-forget progress tracking* — A Ray actor as a shared counter lets workers report progress without blocking the pipeline.
- *Zero-CPU head node* — Learned this one the hard way when a runaway batch job took down our scheduler.
The post includes the KubeRay YAML, Ray Serve autoscaling configs, pipeline code, and the LocalStack parquet workaround that saved us hours of debugging silent hangs.
https://mixpeek.com/blog/ray-distributed-ml-pipeline-archite...
Happy to answer questions about any of the patterns or trade-offs.