Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Relational Self-Attention: What's Missing in Attention for Video Understanding

About

Convolution has been arguably the most important feature transform for modern neural networks, leading to the advance of deep learning. Recent emergence of Transformer networks, which replace convolution layers with self-attention blocks, has revealed the limitation of stationary convolution kernels and opened the door to the era of dynamic feature transforms. The existing dynamic transforms, including self-attention, however, are all limited for video understanding where correspondence relations in space and time, i.e., motion information, are crucial for effective representation. In this work, we introduce a relational feature transform, dubbed the relational self-attention (RSA), that leverages rich structures of spatio-temporal relations in videos by dynamically generating relational kernels and aggregating relational contexts. Our experiments and ablation studies show that the RSA network substantially outperforms convolution and self-attention counterparts, achieving the state of the art on the standard motion-centric benchmarks for video action recognition, such as Something-Something-V1 & V2, Diving48, and FineGym.

Manjin Kim, Heeseung Kwon, Chunyu Wang, Suha Kwak, Minsu Cho• 2021

Related benchmarks

TaskDatasetResultRank
Action RecognitionSomething-Something v2
Top-1 Accuracy67.7
341
Action RecognitionSomething-Something v2 (test)
Top-1 Acc66
333
Action RecognitionSomething-Something V1
Top-1 Acc56.1
162
Action RecognitionDiving-48
Top-1 Acc84.2
82
Action RecognitionDiving-48 (test)
Top-1 Acc84.2
81
Video ClassificationSomething-Something v2
Top-1 Acc66
56
Action RecognitionFineGym Gym288
Mean per-class Accuracy0.509
14
Action RecognitionFineGym Gym99
Mean Per-Class Acc86.4
14
Video ClassificationSomething-Something V1
Top-1 Acc54
13
Video ClassificationDiving-48 v1 (test)
Top-1 Accuracy84.2
11
Showing 10 of 12 rows

Other info

Code

Follow for update