Under the hood, the system combines LLMs for script and lyric writing, custom on-device multimodal embedding models for retrieving candidate clips, and compositional models that select and arrange clips to illustrate the narrative. This involves balancing local and global factors like semantic alignment, diversity, and temporal structure.
We also run embedding and retrieval on-device, so there’s no need to upload your footage, reducing friction, preserving privacy, and avoiding GPU-heavy processing.
Demo for making a voiceover video https://www.youtube.com/shorts/KFobJ_3az9E
Demo for making a music video https://www.youtube.com/shorts/uakZG3rcP_M
We’ve been using it across a range of professional and personal use cases, from realtor home tours and product demos to family videos and storytelling. https://www.heyaio.com/examples
If you’re interested, you can try Hey Aio here (iOS, free): https://apps.apple.com/us/app/hey-aio/id6739266610
We’d love to hear what you think!