Unlike existing serving stacks such as vLLM—which are primarily optimized for text-only workloads—ElasticMM introduces Elastic Multimodal Parallelism (EMP), a new execution paradigm that adapts parallelism across different inference stages and modalities.
Key findings from the paper:
Up to 4.2× reduction in TTFT
3.2×–4.5× higher throughput under mixed multimodal workloads
Modality-aware scheduling, elastic stage partitioning, unified prefix caching, and non-blocking encoding
Paper (OpenReview PDF): https://openreview.net/pdf?id=Zd6VyjmN1S
GitHub repo: https://github.com/hpdps-group/ElasticMM
Curious to hear what the HN community thinks, especially from those building LLM/MLLM inference stacks or dealing with multimodal serving in production.
hidownbb•16m ago