Conditional Generation of Audio from Video via Foley Analogies

About

The sound effects that designers add to videos are designed to convey a particular artistic effect and, thus, may be quite different from a scene's true sound. Inspired by the challenges of creating a soundtrack for a video that differs from its true sound, but that nonetheless matches the actions occurring on screen, we propose the problem of conditional Foley. We present the following contributions to address this problem. First, we propose a pretext task for training our model to predict sound for an input video clip using a conditional audio-visual clip sampled from another time within the same source video. Second, we propose a model for generating a soundtrack for a silent input video, given a user-supplied example that specifies what the video should "sound like". We show through human studies and automated evaluation metrics that our model successfully generates sound from video, while varying its output according to the content of a supplied example. Project site: https://xypb.github.io/CondFoleyGen/

Yuexi Du, Ziyang Chen, Justin Salamon, Bryan Russell, Andrew Owens• 2023

Related benchmarks

Task	Dataset	Result
Conditional Foley Generation	Greatest Hits perceptual study evaluation set (test)	Material Chosen Rate54.3	9
Action Classification	Greatest Hits (test)	Match Accuracy78.2	8
Material Classification	Greatest Hits (test)	Match Accuracy43.4	8
Onset Prediction	Greatest Hits (test)	Onset Acc26.5	7
Onset detection	Greatest Hits	Count Match30	6
Audio-controlled video-to-audio generation	Greatest Hits	OnsetSync AP60	6
Material Classification	Greatest Hits	Accuracy40.6	5
Video-to-Audio Generation	Greatest Hits	Accuracy23.94	2
Timbre Transfer	Greatest Hits (test)	Onset Accuracy39.06	2

Showing 9 of 9 rows

Other info

Code

Follow for update

@wizwand_team Discord