BindWeave (https://www.bindweave1.com ) is our attempt to solve that. It’s a subject-consistent video generation framework that unifies single- and multi-subject prompts using a cross-modal MLLM-DiT architecture—a multimodal large-language-model coupled with a diffusion transformer. By combining entity grounding and representation alignment, the model interprets complex prompts and keeps visual identities stable over time.
We built it because we wanted reliable, controllable subjects for storytelling, digital avatars, and research demos—without retraining for each character. Now creators can describe a scene, attach one or more reference images, and generate stable, high-fidelity clips where everyone stays recognizable throughout.
Demo videos and a short paper summary are on the site. We’d love feedback from anyone working on AI video, cross-modal generation, or identity preservation—what use cases or limitations matter most to you?