Some context on why we're working on this: faces carry emotional signal that text and voice don't. Almost half the human brain is devoted to visual processing, and it's one of the first things we learn as babies. It's also a more accessible medium. Anam started, in part, from Ben watching his gran struggle with her iPad and thinking there should be a face she could just talk to.
cara-3 uses a two-stage pipeline: a diffusion transformer converts audio to motion embeddings (head position, eye gaze, lip shape, expression), then a rendering model applies those to a reference image to produce video frames. Separating motion from rendering means we can animate any face without retraining. The two models run in sequence within ~70ms time-to-first-frame on an H200, so we can run many concurrent avatar sessions on a single GPU.
The core of audio-to-motion is flow matching, but we found off-the-shelf formulations weren't stable enough for this task, so we developed a novel variant. We also built our own training data pipeline (and recently open-sourced the backbone: Metaxy) because existing frameworks made it hard to iterate without rerunning expensive steps.
We commissioned an independent blind evaluation comparing interactive avatars from Anam with HeyGen, Tavus and D-ID. Hundreds of participants played 20 Questions with the different offerings and cara-3 scored highest on every metric (p < 0.001), 24% above the closest competitor on average. What surprised us most: responsiveness correlated with overall experience (Spearman 0.697) far more than visual quality (0.473). In interactive settings, how fast you respond matters more than how good you look.
Ask us anything!
peanut_merchant•1h ago
Most off the shelf solutions, or existing platforms heavily skew towards the normal http web service world. However, the bulk of our interactions happen over webrtc in long-running sessions, where the existing solutions for in-depth metrics and monitoring are much less mature and well documented.
Currently we're using influxdb, prometheus, grafana and some hand rolled monitoring code alongside the stats that webrtc offers itself. Would be interested to know how anyone out there is monitoring conversational flows, and webrtc traffic.