LTX-2.3 is a DiT-based (Diffusion Transformer) audio-video foundation model developed by Lightricks, representing a significant update to LTX-2 with improved audio and visual quality alongside enhanced prompt adherence. The model generates synchronized video and audio within a single unified architecture, enabling practical multimodal content creation from various input combinations.
Key Features
LTX-2.3 supports a broad range of generation modes within its unified architecture:
- Text-to-Video: Generate video content directly from text descriptions
- Image-to-Video: Animate static images into dynamic video sequences
- Video-to-Video: Transform existing video with style or content modifications
- Audio-Visual Generation: Create synchronized audio and video output together
- Cross-Modal Generation: Support for audio-to-video, text-to-audio, and audio-to-audio workflows
The model includes a multi-stage pipeline with spatial upscalers (1.5x and 2x) and a temporal upscaler (2x) for producing higher resolution output and smoother frame rates.
Architecture
LTX-2.3 is built on a Diffusion Transformer (DiT) architecture that combines diffusion models with transformer-based processing. This design handles both video and audio generation within a single framework while maintaining temporal coherence across both modalities.
The model processes video with width and height divisible by 32, and frame counts divisible by 8 plus 1, allowing for flexible output configurations. A distilled variant enables faster generation in as few as 8 steps with classifier-free guidance of 1.
Training and Customization
The base model (dev variant) is fully trainable, supporting various customization approaches:
- LoRA Training: Create Low-Rank Adaptations for specific styles or subjects
- IC-LoRA: Image-Conditioned LoRAs for more precise control
- Motion Adaptation: Train custom motion patterns efficiently
- Style Transfer: Adapt the model to specific visual styles
- Likeness Training: Capture both appearance and sound characteristics
Training for motion, style, or likeness customization can be completed in under one hour in many configurations.
Use Cases
LTX-2.3 is designed for creative video generation applications including:
- Short-form video content creation
- Animation and motion design
- Visual storytelling with synchronized audio
- Creative experimentation with multimodal generation
- Prototyping video concepts from text descriptions
- Video transformation and style transfer
Prompting
Effective prompting significantly impacts generation quality. The model responds well to detailed, descriptive prompts that clearly articulate the desired visual and audio elements. For best results, provide specific details about motion, scene composition, and audio characteristics when generating audiovisual content.
Integration
LTX-2.3 integrates with ComfyUI through built-in LTXVideo nodes, enabling visual workflow-based generation. The model is also available through the LTX-2 PyTorch codebase for programmatic access, with Diffusers support planned.
For more details about the model architecture and capabilities, see the model page on Hugging Face.