Today we're shipping our new model, persona-1.5-live, the first to achieve both photorealism and emotion at conversational latency. We see this as a significant step toward passing the "Turing test" for video agents.
Here's an unedited demo video of a conversation between Cosmo and one of us: https://www.loom.com/share/406534ea9991458cb64030df02e565de
You can also FaceTime Cosmo yourself here: https://demo.keyframelabs.com. Try asking him for a sad Shakespearean monologue. Or for a "funny" dad joke.
Voice has emerged as a primary conversational interface across industries, and we think video is the next leap. In our early pilots, face-to-face interaction drives measurably better outcomes in things like sales training and language learning.
Our constraints from day one have been:
- Make meaningful progress towards beating the uncanny valley
- Run at real-time, with world-scale efficiency
In training persona-1.5-live, we didn't have access to giant clusters or hyperscaler budgets. This forced quite a bit of innovation in how we approached diffusion:
- An aggressively lightweight architecture
- Training tricks to squeeze signal out of limited data
Perhaps the most surprising finding was that, for our problem space, representation quality can be an viable substitute for scale. We spent an inordinate amount of time crafting a from-scratch latent space for persona-1.5-live to keep identity and emotion stable given our compute and data constraints.
The result: photoreal AI humans with emotion and real-time latency, priced at just $0.06 per minute.
If you're interested in building with our API, see the docs here: https://docs.keyframelabs.com. It's free to get started.
Excited to see what y'all think!
parthradia•1h ago