Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

About

We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio. In contrast to single-modality training conditioned on (limited) video data only, MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples. Additionally, we improve audio-visual synchrony with a conditional synchronization module that aligns video conditions with audio latents at the frame level. Trained with a flow matching objective, MMAudio achieves new video-to-audio state-of-the-art among public models in terms of audio quality, semantic alignment, and audio-visual synchronization, while having a low inference time (1.23s to generate an 8s clip) and just 157M parameters. MMAudio also achieves surprisingly competitive performance in text-to-audio generation, showing that joint training does not hinder single-modality performance. Code and demo are available at: https://hkchengrex.github.io/MMAudio

Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, Yuki Mitsufuji• 2024

Related benchmarks

TaskDatasetResultRank
Text-to-Audio GenerationAudioCaps (test)
FAD5.51
138
Video-to-Audio GenerationVGGSound (test)--
62
Joint audio-video generationJavisBench 1.0 (test)
AV-IB0.198
18
Text-to-AudioVGGSound-Omni (test)
KL Divergence1.63
10
Video-to-Audio GenerationEchoFoley 6k
Temporal Control Score30
9
Video-to-AudioVGGSound (test)
APCC-Δ0.536
9
Video-to-Audio GenerationLongVale
FD (VGG)6.41
8
Video-to-Audio GenerationUnAV100
FD (VGG)3.86
8
Video-to-Audio GenerationKling-Eval (test)
FDPaSST205.8
7
Video-to-Audio GenerationVGGSound
FD_VGG0.97
6
Showing 10 of 18 rows

Other info

Code

Follow for update