I'm Heshan, founder of X-Pilot. We're building an AI Video Generator for online courses and educational content. Unlike most text-to-video generator that render videos directly from models (which often produce random stock footage unrelated to the actual content), we take a code-first approach: generate editable code layers, let users verify/refine them, then render to video.
The Problem We're Solving
Most AI video generators treat "education" and "marketing" the same—they optimize for "looks good" rather than "logically accurate." When you feed a technical tutorial or course script into a generic video AI, you get: - Random B-roll that doesn't match the concept being explained - Incorrect visualizations (e.g., showing a "for loop" diagram when explaining recursion) - No way to systematically fix errors without regenerating everything
For educators, corporate trainers, and knowledge creators, accuracy matters more than aesthetics. A single incorrect diagram can break a learner's mental model.
Our Approach: Code as the Intermediate Layer
Instead of text → video blackbox, we do: Text/PDF/Doc → Structured Code (Remotion + Visual Box Engine) → Editable Preview → Final Render
Tech Stack - Agent orchestration: LangGraph (with Gemini 2.5 Flash for planning, reasoning, and content structuring) - Video Code generation model: Gemini3.0 for Remotion Code & Veo 3 (for generative footage where needed) - Code-based rendering: Remotion (React-based video framework) - Knowledge visualization engine: Our own "Visual Box Engine"—a library of parameterized educational animation components (flowcharts, comparisons, step-by-step sequences, system diagrams, etc.) - Voice synthesis: Fish Audio (for natural narration) - Rendering: Google Cloud (distributed video rendering using chrome headless) - Code execution sandbox: E2B (for safe, isolated code execution during generation and preview, but we will update to our own sandbox, because e2b offen time out,and low performance for bundle and render)
Why Remotion + Custom Components? We chose Remotion because: 1. Editability: Every visual element is React code. Users (or our AI agents) can modify text, swap components, adjust timing—without touching raw video files. 2. Reproducibility: Same input → same output. No model randomness in final render. 3. Composability: We built a "Visual Box" library—reusable animation patterns for education (e.g., "cause-and-effect flow," "comparison table," "hierarchical breakdown"). These aren't generic motion graphics; they're designed around pedagogical principles.
The trade-off: We sacrifice some "cinematic quality" for logical accuracy and user control. Right now, output can feel closer to "animated slides" than "documentary footage"—which is actually our biggest unsolved challenge (more on that below).
What We're Struggling With (and Planning to Fix)
1. Code Error Rate Generating Remotion code via LLMs is powerful but error-prone. 2. Limited Asset Handling Right now, if a user wants to insert a custom image/GIF/video mid-generation, they need to upload → we process → regenerate. This breaks flow. 3. The "PPT Feel" Problem This is the hardest one. Because we prioritize structure and editability, our videos can feel like "animated PowerPoint" rather than "produced content."
We're experimenting with: - Hybrid rendering: Use generative video (Veo) for transitions/B-roll, but keep Visual Boxes for core explanations - Cinematic presets: Camera movements, depth effects, color grading—applied as composable layers - Motion design constraints: Teaching our agent to follow motion design principles (easing curves, visual hierarchy, pacing)
Honest question for HN: Has anyone solved this trade-off between "programmatically editable" and "cinematic quality"? I'd love to hear how others have approached it (especially in contexts where correctness > vibes).
bianheshan•1h ago
- Why Gemini over GPT-5/Claude4.5 for agent orchestration: Gemini3.0 is better for react code.
- Visual Box Engine specifics: ~300 parameterized animation templates. Each "box" is a React component with props like {concept, relationships, emphasis, timing}. Example: "CauseEffectFlow" takes an array of steps and auto-generates animated arrows + state transitions.
- E2B sandboxing: We run Remotion preview renders in isolated environments. This prevents malicious/buggy code from affecting other users' jobs.
Happy to answer questions about any part of the stack!