MOVA: Towards Scalable and Synchronized Video-Audio Generation
About
Audio is indispensable for real-world video, yet generation models have largely overlooked audio components. Current approaches to producing audio-visual content often rely on cascaded pipelines, which increase cost, accumulate errors, and degrade overall quality. While systems such as Veo 3 and Sora 2 emphasize the value of simultaneous generation, joint multimodal modeling introduces unique challenges in architecture, data, and training. Moreover, the closed-source nature of existing systems limits progress in the field. In this work, we introduce MOVA (MOSS Video and Audio), an open-source model capable of generating high-quality, synchronized audio-visual content, including realistic lip-synced speech, environment-aware sound effects, and content-aligned music. MOVA employs a Mixture-of-Experts (MoE) architecture, with a total of 32B parameters, of which 18B are active during inference. It supports IT2VA (Image-Text to Video-Audio) generation task. By releasing the model weights and code, we aim to advance research and foster a vibrant community of creators. The released codebase features comprehensive support for efficient inference, LoRA fine-tuning, and prompt enhancement.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-centric Reasoning | VideoThinkBench mini (test) | Average Score12.5 | 22 | |
| Vision-centric tasks | VideoThinkBench mini (test) | Average Score13.4 | 18 | |
| Audio-visual generation | Verse-Bench (All subsets) | IS (Score)4.269 | 7 | |
| Audio-visual generation | Verse-Bench (set3) | DNSMOS3.797 | 6 | |
| Audio-visual generation | Verse-Bench multi-speaker | cpCER14.9 | 6 | |
| Text-to-Audio | AudioBox | Clarity Score (CE)3.41 | 6 | |
| Motion-conditioned Audio-Video Generation | Audio-Video Generation Evaluation Set | AS4.63 | 5 | |
| Joint audio-video generation | (test) | CU Score6.34 | 4 | |
| Joint audio-video generation | Easy (test) | Sync-C7.79 | 4 | |
| Joint audio-video generation | HARD (test) | Sync-C5.38 | 4 |