So I built this as a local alternative. Takes text, generates MP4s with AI narration and images. Uses Nano Banana Pro for images, ElevenLabs for voice, ffmpeg for assembly.
Currently supports 25 visual styles (watercolor, anime, retro-style, etc.) and 16 languages.
It's rough but works for my use case. Sharing in case others want something similar or want to help add more styles and improve it.
I’m hoping it will improve over time and I think the next must be making this fully Open using open alternatives for image and voice.