LTX-2: Efficient Joint Audio-Visual Foundation Model
About
Recent text-to-video diffusion models can generate compelling video sequences, yet they remain silent -- missing the semantic, emotional, and atmospheric cues that audio provides. We introduce LTX-2, an open-source foundational model capable of generating high-quality, temporally synchronized audiovisual content in a unified manner. LTX-2 consists of an asymmetric dual-stream transformer with a 14B-parameter video stream and a 5B-parameter audio stream, coupled through bidirectional audio-video cross-attention layers with temporal positional embeddings and cross-modality AdaLN for shared timestep conditioning. This architecture enables efficient training and inference of a unified audiovisual model while allocating more capacity for video generation than audio generation. We employ a multilingual text encoder for broader prompt understanding and introduce a modality-aware classifier-free guidance (modality-CFG) mechanism for improved audiovisual alignment and controllability. Beyond generating speech, LTX-2 produces rich, coherent audio tracks that follow the characters, environment, style, and emotion of each scene -- complete with natural background and foley elements. In our evaluations, the model achieves state-of-the-art audiovisual quality and prompt adherence among open-source systems, while delivering results comparable to proprietary models at a fraction of their computational cost and inference time. All model weights and code are publicly released.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Generation | ImageNet 256x256 | IS42.57 | 517 | |
| Video Reasoning | VBVR-Bench Out-of-Domain | Average Score0.297 | 39 | |
| Video Generation | short videos 81-frames 240 prompts | Total Score6 | 38 | |
| Image Reconstruction | ImageNet 256p | PSNR26.06 | 38 | |
| Video Reasoning | VBVR-Bench In-Domain | Average Score32.9 | 35 | |
| Image Reconstruction | OmniDoc-TokenBench 256x256 (test) | SSIM73.54 | 23 | |
| Image Reconstruction | FFHQ 1k | PSNR33.63 | 21 | |
| Video Reasoning | VBVR-Bench | Overall Accuracy31.3 | 18 | |
| Joint audio-video generation | JavisBench | Audio-Video Consistency (AV-IB)23.2 | 12 | |
| Joint text-to-audio-video generation | HDTF and Hallo3 English (test) | FID27.46 | 12 |