We've been thinking a lot about inference infrastructure recently, and it seems like the challenges are very different from training.
Training tends to be compute-heavy but predictable, while inference introduces things like:
- latency constraints
- dynamic batching
- unpredictable traffic patterns
- model versioning in production
- GPU utilization issues at smaller batch sizes
For people running inference in production today, what's been the most painful part?