Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models
About
The Video-to-Audio (V2A) model has recently gained attention for its practical application in generating audio directly from silent videos, particularly in video/film production. However, previous methods in V2A have limited generation quality in terms of temporal synchronization and audio-visual relevance. We present Diff-Foley, a synchronized Video-to-Audio synthesis method with a latent diffusion model (LDM) that generates high-quality audio with improved synchronization and audio-visual relevance. We adopt contrastive audio-visual pretraining (CAVP) to learn more temporally and semantically aligned features, then train an LDM with CAVP-aligned visual features on spectrogram latent space. The CAVP-aligned features enable LDM to capture the subtler audio-visual correlation via a cross-attention module. We further significantly improve sample quality with `double guidance'. Diff-Foley achieves state-of-the-art V2A performance on current large scale V2A dataset. Furthermore, we demonstrate Diff-Foley practical applicability and generalization capabilities via downstream finetuning. Project Page: see https://diff-foley.github.io/
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video-to-Audio Generation | VGGSound (test) | FAD5.62 | 95 | |
| Audio-to-Video Retrieval | VGGSound (test) | Recall@111.1 | 13 | |
| Video-to-Audio Retrieval | VGGSound (test) | Recall@10.095 | 11 | |
| Video-to-Audio Generation | VGGSound original (test) | Inception Score62.37 | 8 | |
| Foley generation | VGGSound (test) | FID15.15 | 8 | |
| Video-to-Audio Generation | VGGSound sparse (test) | Alignment2.15 | 8 | |
| Video-to-Audio Generation | MUSIC (test) | Overall Score1.49 | 8 | |
| Spatial Audio Generation | Mixed panoramic video-FOA dataset (YT360) (test) | wCS27 | 6 | |
| Video-to-spatial audio generation | Hybrid (test) | MOS (Subjective Quality)3.68 | 6 | |
| Spatial Audio Generation | YT360 (test) | FD314.6 | 5 |