SANA-WM is a newly released open-source world model from NVLabs for generating 1-minute 720p videos with explicit 6-DoF camera control.
What stands out is the efficiency target: a 2.6B-parameter model trained natively for minute-scale generation, with the paper claiming each 60-second clip can be generated on a single GPU, and a distilled RTX 5090 variant that denoises a 60-second 720p clip in 34 seconds with NVFP4 quantization.
The paper attributes this to a hybrid linear attention design, dual-branch camera control, a two-stage long-video pipeline, and a pose annotation pipeline built from public videos.
What does everyone think about this compared to the large closed world models? What does the explicit camera control bring to this latest model?
jaspanglia•50m ago
The most exciting part is that it’s open-source — innovation is going to compound fast.
rvz•38m ago
Given that is where everything is going, why not just get there faster by open-sourcing Seedance 2.0, Happyhouse, Veo 3 and all the others.
mjgil•56m ago
What stands out is the efficiency target: a 2.6B-parameter model trained natively for minute-scale generation, with the paper claiming each 60-second clip can be generated on a single GPU, and a distilled RTX 5090 variant that denoises a 60-second 720p clip in 34 seconds with NVFP4 quantization.
The paper attributes this to a hybrid linear attention design, dual-branch camera control, a two-stage long-video pipeline, and a pose annotation pipeline built from public videos.
What does everyone think about this compared to the large closed world models? What does the explicit camera control bring to this latest model?