MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
About
We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio. In contrast to single-modality training conditioned on (limited) video data only, MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples. Additionally, we improve audio-visual synchrony with a conditional synchronization module that aligns video conditions with audio latents at the frame level. Trained with a flow matching objective, MMAudio achieves new video-to-audio state-of-the-art among public models in terms of audio quality, semantic alignment, and audio-visual synchronization, while having a low inference time (1.23s to generate an 8s clip) and just 157M parameters. MMAudio also achieves surprisingly competitive performance in text-to-audio generation, showing that joint training does not hinder single-modality performance. Code and demo are available at: https://hkchengrex.github.io/MMAudio
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Audio Generation | AudioCaps (test) | FAD5.51 | 138 | |
| Video-to-Audio Generation | VGGSound (test) | -- | 62 | |
| Joint audio-video generation | JavisBench 1.0 (test) | AV-IB0.198 | 18 | |
| Text-to-Audio | VGGSound-Omni (test) | KL Divergence1.63 | 10 | |
| Video-to-Audio Generation | EchoFoley 6k | Temporal Control Score30 | 9 | |
| Video-to-Audio | VGGSound (test) | APCC-Δ0.536 | 9 | |
| Video-to-Audio Generation | LongVale | FD (VGG)6.41 | 8 | |
| Video-to-Audio Generation | UnAV100 | FD (VGG)3.86 | 8 | |
| Video-to-Audio Generation | Kling-Eval (test) | FDPaSST205.8 | 7 | |
| Video-to-Audio Generation | VGGSound | FD_VGG0.97 | 6 |