Current constraints (honest version)
No speaker diarization yet → one TTS voice for the entire video
Voice choice: female / male
Bilingual subtitles (e.g., EN+JP, EN+ZH, EN+KR) with length-aware line breaking
You can add customized terms and translations in front of the listening model and the translation model to get the best accuracy
Long videos: we re-anchor timestamps periodically to fight drift
Pipeline (condensed)
ASR with word-level timestamps
Segment cleanup + merge tiny fragments
MT twice → produce A/B subtitle tracks; constrain length to reduce overflow; have multiple language model to supervise/double check the translation quality
TTS → single voice (female/male) for the full track
Mixback → keep ambience, duck original, SRT (mono or dual-lang) + dubbed MP4
Why post this now It’s not “magic studio” quality, but it’s dependable for many real-world cases: course videos, onboarding, webinars. We found being explicit about limits (no diarization) actually speeds teams up.
What we’d love feedback on
The current User experience
Acceptable subtitle overflow rate on 30–60 min content
TTS pacing rules that feel most natural for multilingual reading speed
(If there’s interest, we’ll extract a minimal CLI with the exact steps and corner cases called out.)