- they are now using a 27B MoE architecture (with two 14B experts, for low level and high level detail), which were usually only used for autoregressive LLMs rather than diffusion models
- the smaller 5B model supports up to 720p24 video and runs on 24 GB of VRAM, e.g. an RTX 4090, a consumer graphics card
- if their benchmarks are reliable, the model performance is SOTA even compared to closed source models
Seems like you can run it 2 Gpus each having 12 GB VRAM. At least, a breakdown on their GitHub page implied so.
- The 27B "MoE" are not the MoE commonly referred to in LLM world. It is not MoE on FFN layers. It simply means two different models used for different denoising timestep ranges (exactly the same as SDXL-Base / SDXL-Refiner). Calling it MoE is not technically wrong. But claiming "which were usually only used for autoregressive LLMs rather than diffusion models" is just wrong (not to mention HiDream I1 is a model actually incorporated MoE layers (in FFN layer) and is a diffusion model).
- The A14B models can run on 24GiB VRAM too, with CPU offloading and quantization.
- Yes, it is SotA even including some closed source models.
https://github.com/deepbeepmeep/Wan2GP
And the discord community: https://discord.gg/g7efUW9jGV
"Wan2GP" is AI video and images "for the GPU poor", get all this operating with as little as 6GB VRAM, Nvidia only.
There are a lot of people focused on performance, various methods, just as there are a lot of people focused on non-performance issues like fine tunes that add aspects the models lack, such as terminology linking professional media terms to the model, the pop culture terminology the model does not know, accuracy of body posture during fight, dance, gymnastic, and sports activity, and then less flashy but pragmatic actions like proper use of tableware, chopsticks, keyboards and musical instruments - complex actions that stand out when done incorrectly or never shown. The model knowledge is high but has limits, which people are adding.
By GPU poor they didn't mean GPUless or GPU of the previous decade. It's on the readme that only Nvidia is supported.
Also disappointing that I haven't seen anything target the new Ryzen AI chips that can do 96gb since they seem pretty capable. I'm not sure how much memory m4 pro on the apple side can be utilized for this but it seems like the typical machines are 48 or 64gb these days. Lot more bang for your buck than an Nvidia card on paper?
But really all the various video models really want an 80+ gig vram card, to run comfortably. The contortions the ComfyUI community goes through to get things running at a reasonable speed on the current, dinky-sized vram consumer cards, are impressive.
Those were both Image to Video and then I upscaled them to 4k. I made the images using Flux Dev Krea.
Took about 3-4 minutes per video to generate and another 2-3 to upscale. Images took 20-40s to generate.
esseph•5mo ago
yorwba•5mo ago
latentsea•5mo ago
CapsAdmin•5mo ago
bn-l•5mo ago
qiine•5mo ago
ProofHouse•5mo ago
diggan•5mo ago
esseph•5mo ago
diggan•5mo ago
esseph•5mo ago
But, I guess sometimes you use a plane to build a plane while the material is aligned to a particular plane.