Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning

About

We present a simple approach which can turn a ViT encoder into an efficient video model, which can seamlessly work with both image and video inputs. By sparsely sampling the inputs, the model is able to do training and inference from both inputs. The model is easily scalable and can be adapted to large-scale pre-trained ViTs without requiring full finetuning. The model achieves SOTA results and the code will be open-sourced.

AJ Piergiovanni, Weicheng Kuo, Anelia Angelova• 2022

Related benchmarks

Task	Dataset	Result
Image Classification	ImageNet-1K	Top-1 Acc81.4	1239
Action Recognition	Something-Something v2 (val)	Top-1 Accuracy76.1	565
Action Recognition	Kinetics-400	Top-1 Acc90.9	505
Action Recognition	Something-Something v2	Top-1 Accuracy76.1	363
Video Action Recognition	Kinetics 400 (val)	Top-1 Acc90.9	166
Action Recognition	Kinetics-600	Top-1 Acc91.8	97
Action Recognition	Kinetics 700	Top-1 Accuracy83.8	74
Action Recognition	Charades (test)	mAP0.662	53

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord