Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

TRecViT: A Recurrent Video Transformer

About

We propose a novel block for \emph{causal} video modelling. It relies on a time-space-channel factorisation with dedicated blocks for each dimension: gated linear recurrent units (LRUs) perform information mixing over time, self-attention layers perform mixing over space, and MLPs over channels. The resulting architecture \emph{TRecViT} is causal and shows strong performance on sparse and dense tasks, trained in supervised or self-supervised regimes, being the first causal video model in the state-space models family. Notably, our model outperforms or is on par with the popular (non-causal) ViViT-L model on large scale video datasets (SSv2, Kinetics400), while having $3\times$ less parameters, $12\times$ smaller memory footprint, and $5\times$ lower FLOPs count than the full self-attention ViViT, with an inference throughput of about 300 frames per second, running comfortably in real-time. When compared with causal transformer-based models (TSM, RViT) and other recurrent models like LSTM, TRecViT obtains state-of-the-art results on the challenging SSv2 dataset. Code and checkpoints are available online https://github.com/google-deepmind/trecvit.

Viorica P\u{a}tr\u{a}ucean, Xu Owen He, Joseph Heyward, Chuhan Zhang, Mehdi S. M. Sajjadi, George-Cristian Muraru, Artem Zholus, Mahdi Karami, Ross Goroshin, Yutian Chen, Simon Osindero, Jo\~ao Carreira, Razvan Pascanu• 2024

Related benchmarks

TaskDatasetResultRank
Video ClassificationKinetics-400
Top-1 Acc46
131
Video ClassificationKinetics 400 (test)
Top-1 Acc76.5
97
Video RecognitionSS v2
Top-1 Acc68.2
47
Point TrackingDAVIS
AJ70.6
38
Point TrackingPerception (test)
AJ Score78.3
3
Video ClassificationSS v2
Top-1 Accuracy53.9
2
Showing 6 of 6 rows

Other info

Follow for update