Omni2Sound: Towards Unified Video-Text-to-Audio Generation

About

Training a unified model integrating video-to-audio (V2A), text-to-audio (T2A), and joint video-text-to-audio (VT2A) generation offers significant application flexibility, yet faces two unexplored foundational challenges: (1) the scarcity of high-quality audio captions with tight V-A-T alignment, leading to severe semantic conflict between multimodal conditions, and (2) cross-task and intra-task competition, manifesting as an adverse V2A-T2A performance trade-off and modality bias in the VT2A task. First, to address data scarcity, we introduce SoundAtlas, a large-scale dataset (470k pairs) that significantly outperforms existing benchmarks and even human experts in quality. Powered by a novel agentic pipeline, it integrates Vision-to-Language Compression to mitigate visual bias of MLLMs, a Junior-Senior Agent Handoff for a 5$\times$ cost reduction, and rigorous Post-hoc Filtering to ensure fidelity. Consequently, SoundAtlas delivers semantically rich and temporally detailed captions with tight V-A-T alignment. Second, we propose Omni2Sound, a unified VT2A diffusion model supporting flexible input modalities. To resolve the inherent cross-task and intra-task competition, we design a three-stage multi-task progressive training schedule that converts cross-task competition into joint optimization and mitigates modality bias in the VT2A task, maintaining both audio-visual alignment and off-screen audio generation faithfulness. Finally, we construct VGGSound-Omni, a comprehensive benchmark for unified evaluation, including challenging off-screen tracks. With a standard DiT backbone, Omni2Sound achieves unified SOTA performance across all three tasks within a single model, demonstrating strong generalization across benchmarks with heterogeneous input conditions.

Yusheng Dai, Zehua Chen, Yuxuan Jiang, Baolong Gao, Qiuhong Ke, Jianfei Cai, Jun Zhu• 2026

Related benchmarks

Task	Dataset	Result
Text-to-Audio	VGGSound-Omni (test)	KL Divergence1.35	10
Video-and-Text-to-Audio Generation	Kling-Audio Eval	KL Divergence2.1	5
Video-to-Audio	VGGSound-Omni (test)	KL Divergence2.04	5
Audio Captioning	AudioSet	LA-CLAP0.447	4
Audio Captioning	VGGSound	LA-CLAP0.461	3
Audio Captioning	AudioCaps	MWR-S (MLLM)0.75	3
Text-to-Audio Generation	Kling-Audio Eval	KL Divergence2.36	3
Video-to-Audio Generation	Kling-Audio Eval	KL Divergence2.47	3
Video-Text-to-Audio (VT2A)	VGGSound Omini (off-screen track)	FAD0.97	2

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord