169M parameters
Streaming support
Zero-shot voice cloning
0.25 RTF on CPU, meaning it generates 30 seconds of audio in 7.5 seconds
Requires 3-12 seconds of reference audio for voice cloning
Apache 2.0 license
The model was trained on a single L40S GPU. It’s not SOTA in most cases, can be a bit unstable, and sometimes fails to capture voice likeness.
marques576•20h ago
169M parameters
Streaming support
Zero-shot voice cloning
0.25 RTF on CPU, meaning it generates 30 seconds of audio in 7.5 seconds
Requires 3-12 seconds of reference audio for voice cloning
Apache 2.0 license
The model was trained on a single L40S GPU. It’s not SOTA in most cases, can be a bit unstable, and sometimes fails to capture voice likeness.