Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

M&M Mix: A Multimodal Multiview Transformer Ensemble

About

This report describes the approach behind our winning solution to the 2022 Epic-Kitchens Action Recognition Challenge. Our approach builds upon our recent work, Multiview Transformer for Video Recognition (MTV), and adapts it to multimodal inputs. Our final submission consists of an ensemble of Multimodal MTV (M&M) models varying backbone sizes and input modalities. Our approach achieved 52.8% Top-1 accuracy on the test set in action classes, which is 4.1% higher than last year's winning entry.

Xuehan Xiong, Anurag Arnab, Arsha Nagrani, Cordelia Schmid• 2022

Related benchmarks

TaskDatasetResultRank
Action RecognitionEPIC-KITCHENS 100 (test)
Top-1 Verb Acc70.9
101
Action RecognitionEPIC-KITCHENS (val)
Verb Top-1 Acc72
36
Action RecognitionEpic-Kitchens-100 (val)
Top-1 Action Acc56.9
10
Showing 3 of 3 rows

Other info

Follow for update