Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Taming Visually Guided Sound Generation

About

Recent advances in visually-induced audio generation are based on sampling short, low-fidelity, and one-class sounds. Moreover, sampling 1 second of audio from the state-of-the-art model takes minutes on a high-end GPU. In this work, we propose a single model capable of generating visually relevant, high-fidelity sounds prompted with a set of frames from open-domain videos in less time than it takes to play it on a single GPU. We train a transformer to sample a new spectrogram from the pre-trained spectrogram codebook given the set of video features. The codebook is obtained using a variant of VQGAN trained to produce a compact sampling space with a novel spectrogram-based perceptual loss. The generated spectrogram is transformed into a waveform using a window-based GAN that significantly speeds up generation. Considering the lack of metrics for automatic evaluation of generated spectrograms, we also build a family of metrics called FID and MKL. These metrics are based on a novel sound classifier, called Melception, and designed to evaluate the fidelity and relevance of open-domain samples. Both qualitative and quantitative studies are conducted on small- and large-scale datasets to evaluate the fidelity and relevance of generated samples. We also compare our model to the state-of-the-art and observe a substantial improvement in quality, size, and computation time. Code, demo, and samples: v-iashin.github.io/SpecVQGAN

Vladimir Iashin, Esa Rahtu• 2021

Related benchmarks

TaskDatasetResultRank
Video-to-Audio GenerationVGGSound (test)
FAD5.27
62
Conditional Foley GenerationGreatest Hits perceptual study evaluation set (test)
Material Chosen Rate16.3
9
Action ClassificationGreatest Hits (test)
Match Accuracy70.6
8
Video-to-Audio GenerationVGGSound original (test)
Inception Score30.8
8
Material ClassificationGreatest Hits (test)
Match Accuracy29.9
8
Foley generationVGGSound (test)
FID19.31
8
Onset PredictionGreatest Hits (test)
Onset Acc25.8
7
Text-to-sound generationAudioCaps (test)
FID16.87
5
Video-to-Audio GenerationHuman Evaluation V2A
Audio Quality2.76
4
Video-to-Audio GenerationVisualSound (test)
KLD3.41
4
Showing 10 of 12 rows

Other info

Follow for update