I built a small Sora-style video generator as a side experiment

2•kelly99•2mo ago

Comments

kelly99•2mo ago

Hey HN,

I’ve been spending the past month diving into AI video generation — not just using models, but trying to understand the actual constraints behind them. After prototyping a small Sora-style generator on my own, I started to notice a few deeper patterns about the industry that I wanted to share and get feedback on.

1. AI video tools aren’t limited by “models”

Most of the friction today isn’t about model quality:

region-locked access

invite-only rollouts

heavy watermarking

friction in basic usage

short duration limits

no multi-scene support

pricing opaque or unsuitable for small creators

The technology is improving fast — but the accessibility layer hasn’t caught up.

This is why the majority of creators (especially small merchants, indie filmmakers, TikTok sellers, UGC creators) still can’t practically adopt AI video at scale.

2. Multi-scene generation is the “real moat”

Most models can do a single beautiful 2-4 second shot.

But real use cases — ads, storytelling, product demos — need:

shot transitions

visual consistency

character identity retention

stable camera paths

narrative structure

The real challenge is not “make a clip”， but “make a sequence”.

That’s where pipelines, not models, matter.

3. The real bottleneck is temporal coherence

From my experiments, the hardest problems aren’t fancy effects — they’re the boring ones:

slight drift in character identity

physics mismatch between shots

exposure shifts

motion jitter at boundaries

model choosing different “interpretations” each time

There’s no perfect solution yet. Some combination of:

prompt redistribution

style anchors

conditioning

intermediate frames

shot graphs

works “okay”，but there’s huge open research space.

4. Small creators care less about model elegance — more about “does it work for my product?”

This surprised me.

I talked to some merchants and small creators. What they wanted wasn’t:

“best model”

“highest fidelity”

“latest architecture”

They asked for:

no watermark

9:16 format

product-handheld shots

consistent 20–25s video

don’t make me wait

just give me something I can post today

It’s a very different set of priorities than what model researchers focus on.

5. The infra is the unsung hero

Most public discussions focus on models, but from building my prototype I realized:

async queues

model switching

fallback logic

caching policies

GPU scheduling

latency constraints

matter far more for practical AI video creation than architecture diagrams.

Without good infra, even the best models feel unusable.

A prototype I built while exploring these ideas

As a way to understand these bottlenecks more concretely, I built a small prototype called Saro2.ai — basically an experiment in:

10s cinematic clip generation

25s multi-scene “storyboard” generation

attempts at shot consistency

simple scene → shot graph

a multi-model backend with light scheduling

It requires login (to control compute use), but I’m mainly sharing it as an example of the things I’m testing, not trying to “launch a product”.

Here’s the link if anyone wants to see how it behaves: https://saro2.ai/

What I’m hoping to learn

If you’ve worked on:

temporal modeling

multi-scene pipelines

conditioning

generative video infra

shot consistency strategies

I’d love to hear your perspective.

Especially curious about:

what people think the real frontier is

what “must solve” engineering problems exist before AI video is truly usable

whether multi-scene consistency is solvable with heuristics or requires new architectures

Happy to share more details about the pipeline or what didn’t work.

Thanks for reading — and I’d appreciate any thoughts from people working in (or following) this space.

I'm 15 and built a free tool for reading Greek/Latin texts. Would love feedback

How close is AI to taking my job?

You are the reason I am not reviewing this PR

Show HN: FamilyMemories.video – Turn static old photos into 5s AI videos

How Meta Made Linux a Planet-Scale Load Balancer

A Turing Test for AI Coding

How to Identify and Eliminate Unused AWS Resources

A2CDVI – HDMI output from from the Apple IIc's digital video output connector

CLI for Common Playwright Actions

Would you use an e-commerce platform that shares transaction fees with users?

Show HN: SafeClaw – a way to manage multiple Claude Code instances in containers

The Future of the Global Open-Source AI Ecosystem: From DeepSeek to AI+

The Evolution of the Interface

Azure: Virtual network routing appliance overview

Seedance2 – multi-shot AI video generation

Πfs – The Data-Free Filesystem

Go-busybox: A sandboxable port of busybox for AI agents

Quantization-Aware Distillation for NVFP4 Inference Accuracy Recovery [pdf]

xAI Merger Poses Bigger Threat to OpenAI, Anthropic

Atlas Airborne (Boston Dynamics and RAI Institute) [video]

Zen Tools

Is the Detachment in the Room? – Agents, Cruelty, and Empathy

The purpose of Continuous Integration is to fail

Apfelstrudel: Live coding music environment with AI agent chat

What Is Stoicism?

What happens when a neighborhood is built around a farm

Every major galaxy is speeding away from the Milky Way, except one

Extreme Inequality Presages the Revolt Against It

There's no such thing as "tech" (Ten years later)

What Really Killed Flash Player: A Six-Year Campaign of Deliberate Platform Work