We've built LongCat Video Avatar, an audio-driven AI system for long-form avatar video generation with stable identity, natural motion, and professional lip-sync.
Most avatar models work well for short clips but break down over time. LongCat Video Avatar is designed specifically for minutes-to-hours video generation without identity drift or quality collapse.
Key capabilities:
- Long-Form Stability: Generate videos from minutes to hours without quality degradation. Cross-chunk latent stitching reduces visual noise common in chunk-based generation.
- Natural Human Dynamics: Disentangled motion modeling produces realistic gestures and idle motion, even during silent segments.
- Multi-Person Support: Native handling of multi-speaker conversations with accurate turn-taking and identity preservation.
- Production-Ready Output: Up to 720p/30fps with flexible aspect ratios (16:9, 9:16, 1:1). Designed for commercial deployment and SaaS integration.
- Unified Generation Modes: Supports AT2V, ATI2V, and audio-conditioned video continuation in one framework.
Why we built this:
Existing avatar tools struggle with long-form content—they accumulate errors, produce stiff motion, and rely on repeated reference images. We wanted a system purpose-built for podcasts, lectures, corporate presentations, and other extended-format use cases.
We'd love feedback on:
- Workflows where long-form avatar video is useful
- API features for content pipelines
- Technical approaches to improve motion naturalness
Try it here: https://www.longcatavatar.com/?i=d1d5k
Technical details and evaluation results are available on our site.