Wouldn't it be great if you could just think of a scene and instantly turn it into a video the way it appears in your head? Technology isn't quite there yet, but today’s most advanced AI video generation models are getting us closer.
Today we’re taking a look at WAN 2.2 and LTX-2, two open-source/open-weights models that transform text and images into short-form video.
From the outside, WAN 2.2 and LTX-2 are pretty similar tools. They’re both open-source/open-weights diffusion-based video generation models designed to turn images or text prompts into short video clips. Their underlying architecture, however, is very different.
Developed by Alibaba Tongyi Lab, WAN 2.2 is built around a Mixture-of-Experts (MoE) architecture. Instead of using a single neural network to manage the entire denoising process, it employs two specialized “experts”: a high-noise expert for overall structure and layout, and a low-noise expert for refining textures and details like lighting and color tone.
Switching between these two experts means the model can allocate compute depending on what it needs to do at any given moment—focusing on broader structure first and finer details later. It also boosts efficiency by avoiding unnecessary computation.
WAN 2.2 comes in three main variants, each designed for different workflows:
The base WAN 2.2 models generate video only, without native audio output. However, there is a specialized speech-to-video version (WAN 2.2 S2V) that transforms static images and audio inputs into synchronized videos.
Created by Lightricks, LTX-2 is a DiT-based (Diffusion Transformer) audio-video generation model. It produces audio and visuals together in one pass, keeping dialogue, lip movements, and ambient sound aligned coherently.
Its architecture is based on latent diffusion, which means the model works in a compressed version of the video first, before converting it into full resolution. This makes it more memory efficient and enables faster iteration, translating to quicker experimentation and lower hardware overhead.
LTX-2 can generate up to ~20 seconds of synchronized audio and video, with support for high resolutions and high frame rates depending on configuration and available compute. The model offers fine-grained control options—such as LoRA-based customization and multimodal inputs including text, image, video, and audio—for precise creative direction.
This makes LTX-2 a highly flexible model. In short, it supports text-to-video, image-to-video, and native audio-visual generation, along with cross-modal workflows like audio-to-video, text-to-audio, and video-to-audio—all within a single model.
How the two models are designed directly affects what you experience as a user.
For instance, WAN 2.2's MoE design prioritizes structured generation and motion consistency. It boasts strong prompt adherence with high-fidelity output and is more likely to preserve scene intent across frames, sticking closely to what you asked for—albeit at the cost of slightly longer generation times.
LTX-2's latent diffusion approach emphasizes speed and accessibility. It’s faster to iterate with, easier to experiment on, and even offers native audio-video sync. However, it may require more prompt tuning to get exactly what you want.
Both models also integrate with ComfyUI, so you can jump right into testing them out with an intuitive node-based visual workflow.
Neither WAN 2.2 nor LTX-2 is objectively superior to the other. The two open-source/open-weights models are designed for different kinds of workflows and creative goals. The best way to get a feel for them is to actually try them out.
The good news is that both models run well on high-end consumer GPUs, making them far more accessible than many people might expect. With Vast.ai, it’s even easier: you can spin up the right hardware on demand and experiment on your own terms, paying only for the compute you need—and save up to 80% over traditional clouds.
Try WAN 2.2 T2V and WAN 2.2 I2V, or LTX-2 (or both!) in our Model Library, and build your own creative pipeline on Vast.ai today.