I’ve been spending the past month diving into AI video generation — not just using models, but trying to understand the actual constraints behind them. After prototyping a small Sora-style generator on my own, I started to notice a few deeper patterns about the industry that I wanted to share and get feedback on.
1. AI video tools aren’t limited by “models”
Most of the friction today isn’t about model quality:
region-locked access
invite-only rollouts
heavy watermarking
friction in basic usage
short duration limits
no multi-scene support
pricing opaque or unsuitable for small creators
The technology is improving fast — but the accessibility layer hasn’t caught up.
This is why the majority of creators (especially small merchants, indie filmmakers, TikTok sellers, UGC creators) still can’t practically adopt AI video at scale.
2. Multi-scene generation is the “real moat”
Most models can do a single beautiful 2-4 second shot.
But real use cases — ads, storytelling, product demos — need:
shot transitions
visual consistency
character identity retention
stable camera paths
narrative structure
The real challenge is not “make a clip”,
but “make a sequence”.
That’s where pipelines, not models, matter.
3. The real bottleneck is temporal coherence
From my experiments, the hardest problems aren’t fancy effects — they’re the boring ones:
slight drift in character identity
physics mismatch between shots
exposure shifts
motion jitter at boundaries
model choosing different “interpretations” each time
There’s no perfect solution yet.
Some combination of:
prompt redistribution
style anchors
conditioning
intermediate frames
shot graphs
works “okay”,but there’s huge open research space.
4. Small creators care less about model elegance — more about “does it work for my product?”
This surprised me.
I talked to some merchants and small creators. What they wanted wasn’t:
“best model”
“highest fidelity”
“latest architecture”
They asked for:
no watermark
9:16 format
product-handheld shots
consistent 20–25s video
don’t make me wait
just give me something I can post today
It’s a very different set of priorities than what model researchers focus on.
5. The infra is the unsung hero
Most public discussions focus on models, but from building my prototype I realized:
async queues
model switching
fallback logic
caching policies
GPU scheduling
latency constraints
matter far more for practical AI video creation than architecture diagrams.
Without good infra, even the best models feel unusable.
A prototype I built while exploring these ideas
As a way to understand these bottlenecks more concretely, I built a small prototype called Saro2.ai — basically an experiment in:
10s cinematic clip generation
25s multi-scene “storyboard” generation
attempts at shot consistency
simple scene → shot graph
a multi-model backend with light scheduling
It requires login (to control compute use), but I’m mainly sharing it as an example of the things I’m testing, not trying to “launch a product”.
Here’s the link if anyone wants to see how it behaves:
https://saro2.ai/
What I’m hoping to learn
If you’ve worked on:
temporal modeling
multi-scene pipelines
conditioning
generative video infra
shot consistency strategies
I’d love to hear your perspective.
Especially curious about:
what people think the real frontier is
what “must solve” engineering problems exist before AI video is truly usable
whether multi-scene consistency is solvable with heuristics or requires new architectures
Happy to share more details about the pipeline or what didn’t work.
Thanks for reading — and I’d appreciate any thoughts from people working in (or following) this space.
kelly99•1h ago
I’ve been spending the past month diving into AI video generation — not just using models, but trying to understand the actual constraints behind them. After prototyping a small Sora-style generator on my own, I started to notice a few deeper patterns about the industry that I wanted to share and get feedback on.
1. AI video tools aren’t limited by “models”
Most of the friction today isn’t about model quality:
region-locked access
invite-only rollouts
heavy watermarking
friction in basic usage
short duration limits
no multi-scene support
pricing opaque or unsuitable for small creators
The technology is improving fast — but the accessibility layer hasn’t caught up.
This is why the majority of creators (especially small merchants, indie filmmakers, TikTok sellers, UGC creators) still can’t practically adopt AI video at scale.
2. Multi-scene generation is the “real moat”
Most models can do a single beautiful 2-4 second shot.
But real use cases — ads, storytelling, product demos — need:
shot transitions
visual consistency
character identity retention
stable camera paths
narrative structure
The real challenge is not “make a clip”, but “make a sequence”.
That’s where pipelines, not models, matter.
3. The real bottleneck is temporal coherence
From my experiments, the hardest problems aren’t fancy effects — they’re the boring ones:
slight drift in character identity
physics mismatch between shots
exposure shifts
motion jitter at boundaries
model choosing different “interpretations” each time
There’s no perfect solution yet. Some combination of:
prompt redistribution
style anchors
conditioning
intermediate frames
shot graphs
works “okay”,but there’s huge open research space.
4. Small creators care less about model elegance — more about “does it work for my product?”
This surprised me.
I talked to some merchants and small creators. What they wanted wasn’t:
“best model”
“highest fidelity”
“latest architecture”
They asked for:
no watermark
9:16 format
product-handheld shots
consistent 20–25s video
don’t make me wait
just give me something I can post today
It’s a very different set of priorities than what model researchers focus on.
5. The infra is the unsung hero
Most public discussions focus on models, but from building my prototype I realized:
async queues
model switching
fallback logic
caching policies
GPU scheduling
latency constraints
matter far more for practical AI video creation than architecture diagrams.
Without good infra, even the best models feel unusable.
A prototype I built while exploring these ideas
As a way to understand these bottlenecks more concretely, I built a small prototype called Saro2.ai — basically an experiment in:
10s cinematic clip generation
25s multi-scene “storyboard” generation
attempts at shot consistency
simple scene → shot graph
a multi-model backend with light scheduling
It requires login (to control compute use), but I’m mainly sharing it as an example of the things I’m testing, not trying to “launch a product”.
Here’s the link if anyone wants to see how it behaves: https://saro2.ai/
What I’m hoping to learn
If you’ve worked on:
temporal modeling
multi-scene pipelines
conditioning
generative video infra
shot consistency strategies
I’d love to hear your perspective.
Especially curious about:
what people think the real frontier is
what “must solve” engineering problems exist before AI video is truly usable
whether multi-scene consistency is solvable with heuristics or requires new architectures
Happy to share more details about the pipeline or what didn’t work.
Thanks for reading — and I’d appreciate any thoughts from people working in (or following) this space.