Omni2Sound: Towards Unified Video-Text-to-Audio Generation
About
Training a unified model integrating video-to-audio (V2A), text-to-audio (T2A), and joint video-text-to-audio (VT2A) generation offers significant application flexibility, yet faces two unexplored foundational challenges: (1) the scarcity of high-quality audio captions with tight A-V-T alignment, leading to severe semantic conflict between multimodal conditions, and (2) cross-task and intra-task competition, manifesting as an adverse V2A-T2A performance trade-off and modality bias in the VT2A task. First, to address data scarcity, we introduce SoundAtlas, a large-scale dataset (470k pairs) that significantly outperforms existing benchmarks and even human experts in quality. Powered by a novel agentic pipeline, it integrates Vision-to-Language Compression to mitigate visual bias of MLLMs, a Junior-Senior Agent Handoff for a 5 times cost reduction, and rigorous Post-hoc Filtering to ensure fidelity. Consequently, SoundAtlas delivers semantically rich and temporally detailed captions with tight V-A-T alignment. Second, we propose Omni2Sound, a unified VT2A diffusion model supporting flexible input modalities. To resolve the inherent cross-task and intra-task competition, we design a three-stage multi-task progressive training schedule that converts cross-task competition into joint optimization and mitigates modality bias in the VT2A task, maintaining both audio-visual alignment and off-screen audio generation faithfulness. Finally, we construct VGGSound-Omni, a comprehensive benchmark for unified evaluation, including challenging off-screen tracks. With a standard DiT backbone, Omni2Sound achieves unified SOTA performance across all three tasks within a single model, demonstrating strong generalization across benchmarks with heterogeneous input conditions. The project page is at https://swapforward.github.io/Omni2Sound.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Audio | VGGSound-Omni (test) | KL Divergence1.35 | 10 | |
| Video-and-Text-to-Audio Generation | Kling-Audio Eval | KL Divergence2.1 | 5 | |
| Video-to-Audio | VGGSound-Omni (test) | KL Divergence2.04 | 5 | |
| Audio Captioning | AudioSet | LA-CLAP0.447 | 4 | |
| Audio Captioning | VGGSound | LA-CLAP0.461 | 3 | |
| Audio Captioning | AudioCaps | MWR-S (MLLM)0.75 | 3 | |
| Text-to-Audio Generation | Kling-Audio Eval | KL Divergence2.36 | 3 | |
| Video-to-Audio Generation | Kling-Audio Eval | KL Divergence2.47 | 3 | |
| Video-Text-to-Audio (VT2A) | VGGSound Omini (off-screen track) | FAD0.97 | 2 |