Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition

About

Recent video recognition models utilize Transformer models for long-range spatio-temporal context modeling. Video transformer designs are based on self-attention that can model global context at a high computational cost. In comparison, convolutional designs for videos offer an efficient alternative but lack long-range dependency modeling. Towards achieving the best of both designs, this work proposes Video-FocalNet, an effective and efficient architecture for video recognition that models both local and global contexts. Video-FocalNet is based on a spatio-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention for better efficiency. Further, the aggregation step and the interaction step are both implemented using efficient convolution and element-wise multiplication operations that are computationally less expensive than their self-attention counterparts on video representations. We extensively explore the design space of focal modulation-based spatio-temporal context modeling and demonstrate our parallel spatial and temporal encoding design to be the optimal choice. Video-FocalNets perform favorably well against the state-of-the-art transformer-based models for video recognition on five large-scale datasets (Kinetics-400, Kinetics-600, SS-v2, Diving-48, and ActivityNet-1.3) at a lower computational cost. Our code/models are released at https://github.com/TalalWasim/Video-FocalNets.

Syed Talal Wasim, Muhammad Uzair Khattak, Muzammal Naseer, Salman Khan, Mubarak Shah, Fahad Shahbaz Khan• 2023

Related benchmarks

Task	Dataset	Result
Action Recognition	Something-Something v2	Top-1 Accuracy71.1	363
Action Recognition	Diving-48	Top-1 Acc90.8	111
Action Recognition	Diving-48 (test)	Top-1 Acc90.8	92
Video Action Classification	Diving-48	Top-1 Acc90.8	64
Video Action Recognition	Kinetics 400 (test)	Top-1 Accuracy83.6	44
Action Recognition	ActivityNet v1.3	--	31
TTC forecasting	DAD	MSE0.73	19
TTC forecasting	DOTA	MSE1.984	19
Video Action Recognition	Kinetics-600 5 (test)	Top-1 Accuracy86.7	13
Time-To-Collision forecasting	CCD	MSE0.58	13

Showing 10 of 12 rows

Other info

Code

Follow for update

@wizwand_team Discord