Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

UAVM: Towards Unifying Audio and Visual Models

About

Conventional audio-visual models have independent audio and video branches. In this work, we unify the audio and visual branches by designing a Unified Audio-Visual Model (UAVM). The UAVM achieves a new state-of-the-art audio-visual event classification accuracy of 65.8% on VGGSound. More interestingly, we also find a few intriguing properties of UAVM that the modality-independent counterparts do not have.

Yuan Gong, Alexander H. Liu, Andrew Rouditchenko, James Glass• 2022

Related benchmarks

TaskDatasetResultRank
Audio-Visual Event ClassificationVGGSound (test)
Fusion Top-1 Acc65.8
18
Emotion ClassificationCREMA-D
F1 (Macro)73.7
18
Audio-Visual Event ClassificationAudioSet Full
Fusion mAP48
7
Emotional Attribute PredictionMSP-IMPROV Visual
Arousal0.274
6
Emotional Attribute PredictionMSP-IMPROV Audio-Visual
Arousal0.471
6
Emotional Attribute PredictionMSP-IMPROV Acoustic
Arousal0.578
6
Showing 6 of 6 rows

Other info

Code

Follow for update