Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Temporally Aligned Audio for Video with Autoregression

About

We introduce V-AURA, the first autoregressive model to achieve high temporal alignment and relevance in video-to-audio generation. V-AURA uses a high-framerate visual feature extractor and a cross-modal audio-visual feature fusion strategy to capture fine-grained visual motion events and ensure precise temporal alignment. Additionally, we propose VisualSound, a benchmark dataset with high audio-visual relevance. VisualSound is based on VGGSound, a video dataset consisting of in-the-wild samples extracted from YouTube. During the curation, we remove samples where auditory events are not aligned with the visual ones. V-AURA outperforms current state-of-the-art models in temporal alignment and semantic relevance while maintaining comparable audio quality. Code, samples, VisualSound and models are available at https://v-aura.notion.site

Ilpo Viertola, Vladimir Iashin, Esa Rahtu• 2024

Related benchmarks

TaskDatasetResultRank
Video-to-Audio GenerationVGGSound (test)
FAD1.92
62
Video-to-AudioVGGSound (test)
APCC-Δ0.654
9
Video-to-Audio GenerationLongVale
FD (VGG)6.46
8
Video-to-Audio GenerationUnAV100
FD (VGG)4.57
8
Video-to-Audio GenerationKling-Eval (test)
FDPaSST474.6
7
Video-to-Audio GenerationVGGSound
FD_VGG2.88
6
Video-to-AudioVGGSound-Omni (test)
KL Divergence2.28
5
Video-to-Audio GenerationVisualSound (test)
KLD1.76
4
Video-to-Audio GenerationVAS (test)
KLD1.98
3
Showing 9 of 9 rows

Other info

Code

Follow for update