Taming Visually Guided Sound Generation
About
Recent advances in visually-induced audio generation are based on sampling short, low-fidelity, and one-class sounds. Moreover, sampling 1 second of audio from the state-of-the-art model takes minutes on a high-end GPU. In this work, we propose a single model capable of generating visually relevant, high-fidelity sounds prompted with a set of frames from open-domain videos in less time than it takes to play it on a single GPU. We train a transformer to sample a new spectrogram from the pre-trained spectrogram codebook given the set of video features. The codebook is obtained using a variant of VQGAN trained to produce a compact sampling space with a novel spectrogram-based perceptual loss. The generated spectrogram is transformed into a waveform using a window-based GAN that significantly speeds up generation. Considering the lack of metrics for automatic evaluation of generated spectrograms, we also build a family of metrics called FID and MKL. These metrics are based on a novel sound classifier, called Melception, and designed to evaluate the fidelity and relevance of open-domain samples. Both qualitative and quantitative studies are conducted on small- and large-scale datasets to evaluate the fidelity and relevance of generated samples. We also compare our model to the state-of-the-art and observe a substantial improvement in quality, size, and computation time. Code, demo, and samples: v-iashin.github.io/SpecVQGAN
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video-to-Audio Generation | VGGSound (test) | FAD5.27 | 62 | |
| Conditional Foley Generation | Greatest Hits perceptual study evaluation set (test) | Material Chosen Rate16.3 | 9 | |
| Action Classification | Greatest Hits (test) | Match Accuracy70.6 | 8 | |
| Video-to-Audio Generation | VGGSound original (test) | Inception Score30.8 | 8 | |
| Material Classification | Greatest Hits (test) | Match Accuracy29.9 | 8 | |
| Foley generation | VGGSound (test) | FID19.31 | 8 | |
| Onset Prediction | Greatest Hits (test) | Onset Acc25.8 | 7 | |
| Text-to-sound generation | AudioCaps (test) | FID16.87 | 5 | |
| Video-to-Audio Generation | Human Evaluation V2A | Audio Quality2.76 | 4 | |
| Video-to-Audio Generation | VisualSound (test) | KLD3.41 | 4 |