If you've ever tried generating AI video and thought, This looks cool, but it's not what I asked for, you're not alone!
Wan 2.2 was designed to address scaling and efficiency challenges in video generation. It is among the first open-source video generation models to leverage a Mixture-of-Experts (MoE) architecture, enabling more efficient training. Created by Alibaba Tongyi Lab, the model gives creators more control over how their ideas turn into motion – following prompts more closely, with smoother movement and cinematic-quality results – without drastically increasing compute requirements.
On Vast.ai, Wan 2.2 is available in two variants in our Model Library:
Both models share the same underlying architecture, which is where things get interesting...
Most diffusion-based video models use a single neural network to handle the entire denoising process. Wan 2.2's MoE design, on the other hand, uses two specialized "experts" that strategically separate the video generation workflow:
This separation helps reduce common issues with video generation like unstable camera motion and inconsistent frames. It also makes the process more efficient. Even though each model has approximately 27 billion parameters in total, only about 14 billion are active per inference step.
The models automatically switch between experts via signal-to-noise ratio (SNR) thresholds – a measure of how much meaningful visual information remains versus how much noise is still present at each stage of generation. By activating the right expert at the right moment, Wan 2.2 avoids unnecessary computation. This keeps performance much more efficient, comparable to smaller single-expert models.
In short, you get more natural motion dynamics, superior visual fidelity, reduced artifacts, and less wasted compute.
While their underlying architecture is the same, the two Wan 2.2 variants have different creative entry points.
The text-to-video (T2V) model uses text prompts to generate 5-second clips at 480P or 720P resolution, with granular control over lighting, composition, contrast, and color tone.
A few other capabilities of the T2V variant include:
Overall, the text-to-video model is well suited for storyboarding, concept visualization, marketing clips, and rapid prototyping of video ideas from plain language.
The image-to-video (I2V) model begins with a static image input and turns it into a short video sequence. It offers both 480P and 720P output, and can optionally support text prompts to guide content and motion in detail.
Some strengths of the I2V variant include:
The image-to-video model is especially useful for turning concept art, product photos, and illustrations into animated content.
Whether you start from a text prompt or from an image, Wan 2.2 is built for real-world creative and production workflows.
You don't even need enterprise-grade hardware to use the model. Wan 2.2's MoE architecture keeps things so efficient, you can achieve high-quality results on consumer GPUs like the RTX 4090. For larger workloads, multi-GPU inference is also an option. The model supports FSDP and DeepSpeed Ulysses for accelerated generation.
Wan 2.2 pairs particularly well with Vast.ai's flexible GPU infrastructure. On Vast, you can spin up exactly the hardware you need and run the model on a single GPU, or scale across multiple GPUs for faster generation as your video projects grow.
Both the T2V and I2V models are compatible with ComfyUI workflows, making them easy to integrate into tools you may already be using – which are also available directly on Vast.ai.
Ready to start creating with Wan 2.2? Launch it on Vast.ai in just a few clicks and start generating cinematic video today!