UAVM: Towards Unifying Audio and Visual Models

About

Conventional audio-visual models have independent audio and video branches. In this work, we unify the audio and visual branches by designing a Unified Audio-Visual Model (UAVM). The UAVM achieves a new state-of-the-art audio-visual event classification accuracy of 65.8% on VGGSound. More interestingly, we also find a few intriguing properties of UAVM that the modality-independent counterparts do not have.

Yuan Gong, Alexander H. Liu, Andrew Rouditchenko, James Glass• 2022

Related benchmarks

Task	Dataset	Result
Audio-Visual Event Classification	VGGSound (test)	Fusion Top-1 Acc65.8	23
Image-Video Retrieval	ActivityNet	mAP@1034	22
Emotion Classification	CREMA-D	F1 (Macro)73.7	18
Audio-Visual Event Classification	AudioSet Full	Fusion mAP48	7
Emotional Attribute Prediction	MSP-IMPROV Visual	Arousal0.274	6
Emotional Attribute Prediction	MSP-IMPROV Audio-Visual	Arousal0.471	6
Emotional Attribute Prediction	MSP-IMPROV Acoustic	Arousal0.578	6

Showing 7 of 7 rows

Other info

Code

Follow for update

@wizwand_team Discord