LTX-2.3

Video

ComfyUI

LTX-2.3 is a DiT-based audio-video foundation model with improved quality and prompt adherence for synchronized video and audio generation

On-Demand Dedicated 1xRTX Pro 6000 WS

Details

Modalities

video

Recommended Hardware

1xRTX Pro 6000 WS

Estimated Price

Provider

Lightricks

Family

LTXV

License

LTX-2 Community License Agreement

LTX-2.3 is a DiT-based (Diffusion Transformer) audio-video foundation model developed by Lightricks, representing a significant update to LTX-2 with improved audio and visual quality alongside enhanced prompt adherence. The model generates synchronized video and audio within a single unified architecture, enabling practical multimodal content creation from various input combinations.

Key Features

LTX-2.3 supports a broad range of generation modes within its unified architecture:

Text-to-Video: Generate video content directly from text descriptions
Image-to-Video: Animate static images into dynamic video sequences
Video-to-Video: Transform existing video with style or content modifications
Audio-Visual Generation: Create synchronized audio and video output together
Cross-Modal Generation: Support for audio-to-video, text-to-audio, and audio-to-audio workflows

The model includes a multi-stage pipeline with spatial upscalers (1.5x and 2x) and a temporal upscaler (2x) for producing higher resolution output and smoother frame rates.

Architecture

LTX-2.3 is built on a Diffusion Transformer (DiT) architecture that combines diffusion models with transformer-based processing. This design handles both video and audio generation within a single framework while maintaining temporal coherence across both modalities.

The model processes video with width and height divisible by 32, and frame counts divisible by 8 plus 1, allowing for flexible output configurations. A distilled variant enables faster generation in as few as 8 steps with classifier-free guidance of 1.

Training and Customization

The base model (dev variant) is fully trainable, supporting various customization approaches:

LoRA Training: Create Low-Rank Adaptations for specific styles or subjects
IC-LoRA: Image-Conditioned LoRAs for more precise control
Motion Adaptation: Train custom motion patterns efficiently
Style Transfer: Adapt the model to specific visual styles
Likeness Training: Capture both appearance and sound characteristics

Training for motion, style, or likeness customization can be completed in under one hour in many configurations.

Use Cases

LTX-2.3 is designed for creative video generation applications including:

Short-form video content creation
Animation and motion design
Visual storytelling with synchronized audio
Creative experimentation with multimodal generation
Prototyping video concepts from text descriptions
Video transformation and style transfer

Prompting

Effective prompting significantly impacts generation quality. The model responds well to detailed, descriptive prompts that clearly articulate the desired visual and audio elements. For best results, provide specific details about motion, scene composition, and audio characteristics when generating audiovisual content.

Integration

LTX-2.3 integrates with ComfyUI through built-in LTXVideo nodes, enabling visual workflow-based generation. The model is also available through the LTX-2 PyTorch codebase for programmatic access, with Diffusers support planned.

For more details about the model architecture and capabilities, see the model page on Hugging Face.

Quick Start Guide

Choose a model and click 'Deploy' above to find available GPUs recommended for this model.

Rent your dedicated instance preconfigured with the model you've selected.

Start sending requests to your model instance and getting responses right now.

Related Models

video

LTX-2

LTX-2 is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model