Temporally Aligned Audio for Video with Autoregression

About

We introduce V-AURA, the first autoregressive model to achieve high temporal alignment and relevance in video-to-audio generation. V-AURA uses a high-framerate visual feature extractor and a cross-modal audio-visual feature fusion strategy to capture fine-grained visual motion events and ensure precise temporal alignment. Additionally, we propose VisualSound, a benchmark dataset with high audio-visual relevance. VisualSound is based on VGGSound, a video dataset consisting of in-the-wild samples extracted from YouTube. During the curation, we remove samples where auditory events are not aligned with the visual ones. V-AURA outperforms current state-of-the-art models in temporal alignment and semantic relevance while maintaining comparable audio quality. Code, samples, VisualSound and models are available at https://v-aura.notion.site

Ilpo Viertola, Vladimir Iashin, Esa Rahtu• 2024

Related benchmarks

Task	Dataset	Result
Video-to-Audio Generation	VGGSound (test)	FAD1.92	95
Video-to-Audio Generation	VGGSound	FD_VGG2.2	22
Video-to-Audio	VGGSound (test)	FD (PaSST)218.5	20
Video-to-Audio Generation	LongVale	FD (VGG)6.46	8
Video-to-Audio Generation	UnAV100	FD (VGG)4.57	8
Video-to-Audio Generation	Kling-Eval (test)	FDPaSST474.6	7
Video-to-Audio	VGGSound-Omni (test)	KL Divergence2.28	5
Video-to-Audio Generation	VisualSound (test)	KLD1.76	4
Video-to-Audio Generation	VAS (test)	KLD1.98	3

Showing 9 of 9 rows

Other info

Code

Follow for update

@wizwand_team Discord