Anam Cara-3: Why we think AI needs a face

20•grayne•1h ago

Hey HN, we're Ben and Caoimhe, cofounders of Anam. We built a service for interactive avatars and just shipped our latest model, cara-3. Try it at anam.ai, no sign-up required, or build with it at lab.anam.ai or anam.ai/cookbook.

Some context on why we're working on this: faces carry emotional signal that text and voice don't. Almost half the human brain is devoted to visual processing, and it's one of the first things we learn as babies. It's also a more accessible medium. Anam started, in part, from Ben watching his gran struggle with her iPad and thinking there should be a face she could just talk to.

cara-3 uses a two-stage pipeline: a diffusion transformer converts audio to motion embeddings (head position, eye gaze, lip shape, expression), then a rendering model applies those to a reference image to produce video frames. Separating motion from rendering means we can animate any face without retraining. The two models run in sequence within ~70ms time-to-first-frame on an H200, so we can run many concurrent avatar sessions on a single GPU.

The core of audio-to-motion is flow matching, but we found off-the-shelf formulations weren't stable enough for this task, so we developed a novel variant. We also built our own training data pipeline (and recently open-sourced the backbone: Metaxy) because existing frameworks made it hard to iterate without rerunning expensive steps.

We commissioned an independent blind evaluation comparing interactive avatars from Anam with HeyGen, Tavus and D-ID. Hundreds of participants played 20 Questions with the different offerings and cara-3 scored highest on every metric (p < 0.001), 24% above the closest competitor on average. What surprised us most: responsiveness correlated with overall experience (Spearman 0.697) far more than visual quality (0.473). In interactive settings, how fast you respond matters more than how good you look.

Ask us anything!

Comments

peanut_merchant•1h ago

One of the backend developers at Anam here, one of the hardest parts of developing this has been monitoring and analytics.

Most off the shelf solutions, or existing platforms heavily skew towards the normal http web service world. However, the bulk of our interactions happen over webrtc in long-running sessions, where the existing solutions for in-depth metrics and monitoring are much less mature and well documented.

Currently we're using influxdb, prometheus, grafana and some hand rolled monitoring code alongside the stats that webrtc offers itself. Would be interested to know how anyone out there is monitoring conversational flows, and webrtc traffic.

iogbole•58m ago

Really interesting architecture choice separating motion from rendering. That feels like the right abstraction boundary if you want identity generalisation without retraining.

The latency numbers are what stood out to me though. ~70ms time-to-first-frame is genuinely impressive for an interactive loop. In real conversations, responsiveness dominates perceived realism way more than visual fidelity, so that correlation result makes intuitive sense.

Curious how robust the audio-to-motion mapping is under messy real-world input (overlapping speech, accents, background noise, etc.). Does the flow-matching variant help mostly with stability during training, or also temporal consistency during inference?

grayne•52m ago

Full technical blog here: https://anam.ai/blog/cara-3-interactive-avatars

Show HN: Deathwink – Send messages to people after you die

Mac is now a gaming PC

Why Europe doesn't have a Tesla

Baking Custom Images for AI Agents

AI Agent swarm for Stock trading simulation

Show HN: Google rejected my privacy app for "low engagement"

Show HN: Mirroir – MCP server that gives AI agents a real iPhone to control

Molecular solar thermal energy storage in Dewar pyrimidone beyond 1.6 MJ/kg

Level of Detail

Dev implements HDMI FRL in AMDGPU, hence HDMI 2.1 on AMD Linux driver

Logic MSO – Oscilloscope with Python Support

Why AI writing is so generic, boring, and dangerous: Semantic ablation

Show HN: Wit-ts – A type-level WIT parser for TypeScript

Where Does Gold Come From?

Show HN: My 16MB vibe-coded voice cloning app

Intelligent AI Delegation

Show HN: Boolean-query-parser – From a 4-hour hack to 3k downloads

RCT: Vaporized cannabis versus placebo for acute migraine

Show HN: Local Voice Assistant

Sentinel – watch over your Tailscale network and notify of changes

Temporal Raises $300M Series D to Make Agentic AI Real for Companies

Show HN: MAKO – Open protocol for LLM-optimized web content (93% fewer tokens)

Show HN: Cai – AI actions on your clipboard, runs locally (macOS, open source)

Show HN: Kremis – Deterministic memory graph for AI agents (Rust)

Instagram boss defends app in trial over alleged harms to kids

Java.evolved: Java has evolved. Your code can too

Vibe coding broke the Ballmer Peak

Quiet: A private, P2P alternative to Slack and Discord built on Tor and IPFS

Many consumer electronics manufacturers will bankrupt due to AI memory crisis

EU launches probe into xAI over sexualized images