LTX-2: Efficient Joint Audio-Visual Foundation Model

About

Recent text-to-video diffusion models can generate compelling video sequences, yet they remain silent -- missing the semantic, emotional, and atmospheric cues that audio provides. We introduce LTX-2, an open-source foundational model capable of generating high-quality, temporally synchronized audiovisual content in a unified manner. LTX-2 consists of an asymmetric dual-stream transformer with a 14B-parameter video stream and a 5B-parameter audio stream, coupled through bidirectional audio-video cross-attention layers with temporal positional embeddings and cross-modality AdaLN for shared timestep conditioning. This architecture enables efficient training and inference of a unified audiovisual model while allocating more capacity for video generation than audio generation. We employ a multilingual text encoder for broader prompt understanding and introduce a modality-aware classifier-free guidance (modality-CFG) mechanism for improved audiovisual alignment and controllability. Beyond generating speech, LTX-2 produces rich, coherent audio tracks that follow the characters, environment, style, and emotion of each scene -- complete with natural background and foley elements. In our evaluations, the model achieves state-of-the-art audiovisual quality and prompt adherence among open-source systems, while delivering results comparable to proprietary models at a fraction of their computational cost and inference time. All model weights and code are publicly released.

Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, Eitan Richardson, Guy Shiran, Itay Chachy, Jonathan Chetboun, Michael Finkelson, Michael Kupchick, Nir Zabari, Nitzan Guetta, Noa Kotler, Ofir Bibi, Ori Gordon, Poriya Panet, Roi Benita, Shahar Armon, Victor Kulikov, Yaron Inger, Yonatan Shiftan, Zeev Melumian, Zeev Farbman• 2026

Related benchmarks

Task	Dataset	Result
Image Generation	ImageNet 256x256	IS42.57	606
Image-to-Video Generation	VBench	Motion Smoothness0.9938	46
Robotic Video Generation	R-Bench	Average Score38.1	44
Video Reasoning	VBVR-Bench Out-of-Domain	Average Score0.297	39
Video Generation	short videos 81-frames 240 prompts	Total Score6	38
Image Reconstruction	ImageNet 256p	PSNR26.06	38
Video Reasoning	VBVR-Bench In-Domain	Average Score32.9	35
Image Reconstruction	OmniDoc-TokenBench 256x256 (test)	SSIM73.54	23
Image Reconstruction	FFHQ 1k	PSNR33.63	21
Video Reasoning	VBVR-Bench	Overall Accuracy31.3	18

Showing 10 of 57 rows

Other info

GitHub

Follow for update

@wizwand_team Discord