Taming Visually Guided Sound Generation

About

Recent advances in visually-induced audio generation are based on sampling short, low-fidelity, and one-class sounds. Moreover, sampling 1 second of audio from the state-of-the-art model takes minutes on a high-end GPU. In this work, we propose a single model capable of generating visually relevant, high-fidelity sounds prompted with a set of frames from open-domain videos in less time than it takes to play it on a single GPU. We train a transformer to sample a new spectrogram from the pre-trained spectrogram codebook given the set of video features. The codebook is obtained using a variant of VQGAN trained to produce a compact sampling space with a novel spectrogram-based perceptual loss. The generated spectrogram is transformed into a waveform using a window-based GAN that significantly speeds up generation. Considering the lack of metrics for automatic evaluation of generated spectrograms, we also build a family of metrics called FID and MKL. These metrics are based on a novel sound classifier, called Melception, and designed to evaluate the fidelity and relevance of open-domain samples. Both qualitative and quantitative studies are conducted on small- and large-scale datasets to evaluate the fidelity and relevance of generated samples. We also compare our model to the state-of-the-art and observe a substantial improvement in quality, size, and computation time. Code, demo, and samples: v-iashin.github.io/SpecVQGAN

Vladimir Iashin, Esa Rahtu• 2021

Related benchmarks

Task	Dataset	Result
Video-to-Audio Generation	VGGSound (test)	FAD5.27	95
Video-to-Audio Generation	VGGSound	FD_VGG4.8	22
Conditional Foley Generation	Greatest Hits perceptual study evaluation set (test)	Material Chosen Rate16.3	9
Action Classification	Greatest Hits (test)	Match Accuracy70.6	8
Video-to-Audio Generation	VGGSound original (test)	Inception Score30.8	8
Material Classification	Greatest Hits (test)	Match Accuracy29.9	8
Foley generation	VGGSound (test)	FID19.31	8
Onset Prediction	Greatest Hits (test)	Onset Acc25.8	7
Text-to-sound generation	AudioCaps (test)	FID16.87	5
Video-to-Audio Generation	Human Evaluation V2A	Audio Quality2.76	4

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord