Read, Watch and Scream! Sound Generation from Text and Video

About

Despite the impressive progress of multimodal generative models, video-to-audio generation still suffers from limited performance and limits the flexibility to prioritize sound synthesis for specific objects within the scene. Conversely, text-to-audio generation methods generate high-quality audio but pose challenges in ensuring comprehensive scene depiction and time-varying control. To tackle these challenges, we propose a novel video-and-text-to-audio generation method, called \ours, where video serves as a conditional control for a text-to-audio generation model. Especially, our method estimates the structural information of sound (namely, energy) from the video while receiving key content cues from a user prompt. We employ a well-performing text-to-audio model to consolidate the video control, which is much more efficient for training multimodal diffusion models with massive triplet-paired (audio-video-text) data. In addition, by separating the generative components of audio, it becomes a more flexible system that allows users to freely adjust the energy, surrounding environment, and primary sound source according to their preferences. Experimental results demonstrate that our method shows superiority in terms of quality, controllability, and training efficiency. Code and demo are available at https://naver-ai.github.io/rewas.

Yujin Jeong, Yunji Kim, Sanghyuk Chun, Jiyoung Lee• 2024

Related benchmarks

Task	Dataset	Result
Video-to-Audio Generation	VGGSound (test)	FAD1.79	95
Audio Assessment Correlation	PAM	LCC0.5283	45
Musicality Evaluation	MusicEval (test)	SRCC0.6764	26
Joint audio-video generation	JavisBench 1.0 (test)	AV-IB0.11	18
Musicality Evaluation	CMI-Pref	Accuracy0.738	15
Musicality Evaluation	Music Arena	Accuracy0.6776	15
Joint audio-video generation	JavisBench	Audio-Video Consistency (AV-IB)11	12
Binaural Audio Generation	BinauralVGGSound (test)	KLPaSST1.57	6
Video-to-Binaural Audio Generation	BinauralVGGSound (test)	ISPaSST Score11.86	6
Video-to-Audio Generation	Human Evaluation V2A	Audio Quality3.7	4

Showing 10 of 12 rows

Other info

Code

Follow for update

@wizwand_team Discord