LTX-2

Video

ComfyUI

LTX-2 is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model

On-Demand Dedicated 1xRTX Pro 6000 WS

Details

Modalities

video

Recommended Hardware

1xRTX Pro 6000 WS

Estimated Price

Provider

Lightricks

Family

LTXV

License

LTX-2 Community License Agreement

LTX-2 is a DiT-based (Diffusion Transformer) audio-video foundation model developed by Lightricks that generates synchronized video and audio within a single unified model. With 19 billion parameters, it represents a significant advancement in multimodal generation, enabling practical video creation with accompanying audio from various input modalities.

Key Features

LTX-2 supports multiple generation modes within a single architecture:

Text-to-Video: Generate video content directly from text descriptions
Image-to-Video: Animate static images into dynamic video sequences
Audio-Visual Generation: Create synchronized audio and video output together
Cross-Modal Generation: Support for audio-to-video, text-to-audio, and video-to-audio workflows

The unified architecture allows all these capabilities to work together seamlessly, making it possible to generate complete audiovisual content from simple prompts.

Architecture

LTX-2 is built on a Diffusion Transformer (DiT) architecture, combining the strengths of diffusion models with transformer-based processing. This design enables the model to handle both video and audio generation within a single framework, maintaining temporal coherence across both modalities.

The model processes video with width and height divisible by 32, and frame counts divisible by 8 plus 1, allowing for flexible output configurations while maintaining generation quality.

Training and Customization

The base model is fully trainable, supporting various customization approaches:

LoRA Training: Create Low-Rank Adaptations for specific styles or subjects
IC-LoRA: Image-Conditioned LoRAs for more precise control
Motion Adaptation: Train custom motion patterns efficiently
Style Transfer: Adapt the model to specific visual styles
Likeness Training: Capture both appearance and sound characteristics

These customization options enable users to adapt LTX-2 for specific creative applications while building on its foundation capabilities.

Use Cases

LTX-2 is designed for creative video generation applications including:

Short-form video content creation
Animation and motion design
Visual storytelling with synchronized audio
Creative experimentation with multimodal generation
Prototyping video concepts from text descriptions

Prompting

Effective prompting significantly impacts generation quality. The model responds well to detailed, descriptive prompts that clearly articulate the desired visual and audio elements. For best results, users should provide specific details about motion, scene composition, and audio characteristics when generating audiovisual content.

Integration

LTX-2 integrates with ComfyUI through built-in LTXVideo nodes, enabling visual workflow-based generation. The model is also supported in the Hugging Face Diffusers library for programmatic access.

For more details about the model architecture and training approach, see the model page on Hugging Face.