Attention Bottlenecks for Multimodal Fusion

About

Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio. Machine perception models, in stark contrast, are typically modality-specific and optimised for unimodal benchmarks, and hence late-stage fusion of final representations or predictions from each modality (`late-fusion') is still a dominant paradigm for multimodal video classification. Instead, we introduce a novel transformer based architecture that uses `fusion bottlenecks' for modality fusion at multiple layers. Compared to traditional pairwise self-attention, our model forces information between different modalities to pass through a small number of bottleneck latents, requiring the model to collate and condense the most relevant information in each modality and only share what is necessary. We find that such a strategy improves fusion performance, at the same time reducing computational cost. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound. All code and models will be released.

Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, Chen Sun• 2021

Related benchmarks

Task	Dataset	Result
Action Recognition	Kinetics-400	Top-1 Acc80.8	498
Action Recognition	UCF101 (test)	Accuracy91.8	357
Text-to-Video Retrieval	MSR-VTT (test)	R@13.8	265
Audio Classification	AudioSet 20K	mAP31.3	147
Action Recognition	EPIC-KITCHENS 100 (test)	Top-1 Verb Acc64.8	101
Audio Classification	AudioSet 2M	mAP44.3	98
Multimodal Multilabel Classification	MM-IMDB (test)	Macro F159.6	94
Audio Classification	VGG-Sound	Top-1 Accuracy52.3	83
Audio Classification	AudioSet	mAP41.5	60
Classification	AudioSet (test)	mAP44.3	57

Showing 10 of 54 rows

Other info

Follow for update

@wizwand_team Discord