TL;DR: Diffusion-based TTS models sound amazing but break down for real-time streaming because they require full-sequence attention. StreamFlow introduces a block-wise guided attention scheme that lets diffusion transformers generate speech chunk-by-chunk with near–SOTA quality and predictable low latency.
Why this matters:
Current diffusion speech models need to see the entire audio sequence, making them too slow and memory-heavy for assistants, agents, or anything that needs instant voice responses. Causal masks sound robotic; chunking adds weird seams. Streaming TTS has been stuck with a quality–latency tradeoff.
The idea:
StreamFlow restricts attention using sliding windows over blocks:
Each block can see W_b past blocks and W_f future blocks
Compute becomes roughly O(B × W × N) instead of full O(N²)
Prosody stays smooth, latency stays constant, and boundaries disappear with small overlaps + cross-fades
How it works:
The system is still a Diffusion Transformer, but trained in two phases:
Full-attention pretraining for global quality
Block-wise fine-tuning to adapt to streaming constraints
Generates mel-spectrograms; BigVGAN vocoder runs in parallel.
MOS tests show near-indistinguishable quality vs non-streaming diffusion
Speaker similarity within ~2%, prosody continuity preserved
Key ablation takeaways:
Past context helps until ~3 blocks; more adds little
Even a tiny future window greatly boosts naturalness
Best results: 0.4–0.6s block size, ~10–20% overlap
Comparison:
Autoregressive TTS → streaming but meh quality
GAN TTS → fast but inconsistent
Causal diffusion → real-time but degraded
StreamFlow → streaming + near-SOTA quality
Bigger picture:
Smart attention shaping lets diffusion models work in real time without throwing away global quality. The same technique could apply to streaming music generation, translation, or interactive media.
PranayBatta•1d ago
Why this matters: Current diffusion speech models need to see the entire audio sequence, making them too slow and memory-heavy for assistants, agents, or anything that needs instant voice responses. Causal masks sound robotic; chunking adds weird seams. Streaming TTS has been stuck with a quality–latency tradeoff.
The idea: StreamFlow restricts attention using sliding windows over blocks:
Each block can see W_b past blocks and W_f future blocks
Compute becomes roughly O(B × W × N) instead of full O(N²)
Prosody stays smooth, latency stays constant, and boundaries disappear with small overlaps + cross-fades
How it works: The system is still a Diffusion Transformer, but trained in two phases:
Full-attention pretraining for global quality
Block-wise fine-tuning to adapt to streaming constraints
Generates mel-spectrograms; BigVGAN vocoder runs in parallel.
Performance:
~180ms first-packet latency (80ms model, 60ms vocoder, 40ms overhead)
No latency growth with longer speech
MOS tests show near-indistinguishable quality vs non-streaming diffusion
Speaker similarity within ~2%, prosody continuity preserved
Key ablation takeaways:
Past context helps until ~3 blocks; more adds little
Even a tiny future window greatly boosts naturalness
Best results: 0.4–0.6s block size, ~10–20% overlap
Comparison:
Autoregressive TTS → streaming but meh quality
GAN TTS → fast but inconsistent
Causal diffusion → real-time but degraded
StreamFlow → streaming + near-SOTA quality
Bigger picture: Smart attention shaping lets diffusion models work in real time without throwing away global quality. The same technique could apply to streaming music generation, translation, or interactive media.