UAVM: Towards Unifying Audio and Visual Models
About
Conventional audio-visual models have independent audio and video branches. In this work, we unify the audio and visual branches by designing a Unified Audio-Visual Model (UAVM). The UAVM achieves a new state-of-the-art audio-visual event classification accuracy of 65.8% on VGGSound. More interestingly, we also find a few intriguing properties of UAVM that the modality-independent counterparts do not have.
Yuan Gong, Alexander H. Liu, Andrew Rouditchenko, James Glass• 2022
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio-Visual Event Classification | VGGSound (test) | Fusion Top-1 Acc65.8 | 18 | |
| Emotion Classification | CREMA-D | F1 (Macro)73.7 | 18 | |
| Audio-Visual Event Classification | AudioSet Full | Fusion mAP48 | 7 | |
| Emotional Attribute Prediction | MSP-IMPROV Visual | Arousal0.274 | 6 | |
| Emotional Attribute Prediction | MSP-IMPROV Audio-Visual | Arousal0.471 | 6 | |
| Emotional Attribute Prediction | MSP-IMPROV Acoustic | Arousal0.578 | 6 |
Showing 6 of 6 rows