We run 20M models in parallel on Ray

https://mixpeek.com/blog/ray-distributed-ml-pipeline-architecture

1•Beefin•1h ago

Comments

Beefin•1h ago

We process video, images, and documents through 20+ ML models simultaneously at Mixpeek. A single 10-minute video triggers transcription, visual embeddings, scene descriptions, face detection, object detection, brand safety classification, and more — all in parallel with different compute requirements.

We wrote up the full Ray architecture we use in production on KubeRay/GKE. Not a tutorial — more of a "here's what we actually run and what bit us."

Some highlights:

- *Custom resource isolation* — We use a synthetic `{"batch": 1}` resource to prevent batch pipeline tasks from starving Ray Serve inference replicas. Same cluster, zero interference, no runtime overhead.

- *Flexible actor pools* — Fixed-size `ActorPoolStrategy(size=8)` deadlocks when concurrent jobs compete for workers. `min_size=1, max_size=N` guarantees every job can make progress.

- *Shared preprocessing* — Naive approach runs S3 download + format normalization once per extractor. With 10 extractors on 1,000 files, that's 10,000 redundant reads. We preprocess once and fan out via Ray Dataset.

- *Distributed Qdrant writes* — Ray Data's `Datasink` API distributes vector DB writes across all workers with backpressure, instead of collecting everything on one node.

- *Fire-and-forget progress tracking* — A Ray actor as a shared counter lets workers report progress without blocking the pipeline.

- *Zero-CPU head node* — Learned this one the hard way when a runaway batch job took down our scheduler.

The post includes the KubeRay YAML, Ray Serve autoscaling configs, pipeline code, and the LocalStack parquet workaround that saved us hours of debugging silent hangs.

https://mixpeek.com/blog/ray-distributed-ml-pipeline-archite...

Happy to answer questions about any of the patterns or trade-offs.

Fixing Slow AWS Uploads

Show HN: Raindrop Self Diagnostics: let agents self-report issues

Toilet Map [UK]

From Jamstack to CAMstack – Bridging the Content Gap

The Pentagon Threatens Anthropic

The Myth of the Chad

om

Fentanyl or phony? Machine learning algorithm learns opioid signatures

Time-Travel Debugging: Replaying Production Bugs Locally

Show HN: Djevops – Deploy Django Easily

A federal experiment opens up a new market for digital health – if it works

Aletheia Tackles FirstProof Autonomously

Show HN: Mamba3-minimal – PyTorch implementation of Mamba-3

Show HN: DRYwall – Claude Code plugin to to deduplicate code with jscpd

Stylometry Protection (Using Local LLMs)

Surfboard Makers

Don't ask if it works. Ask for proof

Perplexity Computer: research, design, code, deploy, and manage any project

Show HN: Guard – An open-core governance layer for AI-generated code

Sandboxes won't save you from OpenClaw

Grail’s Cancer Detection Test Fails in Major Study

AI has gotten good at finding bugs, not so good at swatting them

Fire Them All; God Will Know His Own

Aion Longevity iOS App

MicroTimes Interviews Borland's Philippe Kahn (1985)

C++ Default Constructor Riddle

Show HN: Go-GATE – Database-grade safety for AI agents

Analyzing Latency Hiding and Parallelism in an MLIR-Based AI Kernel Compiler

Show HN: A site only LLM can access

JetStream NATS.io C#: Example primitive for composite learning, reading data