Scaling 4D Representations

About

Scaling has not yet been convincingly demonstrated for pure self-supervised learning from video. However, prior work has focused evaluations on semantic-related tasks $\unicode{x2013}$ action classification, ImageNet classification, etc. In this paper we focus on evaluating self-supervised learning on non-semantic vision tasks that are more spatial (3D) and temporal (+1D = 4D), such as camera pose estimation, point and object tracking, and depth estimation. We show that by learning from very large video datasets, masked auto-encoding (MAE) with transformer video models actually scales, consistently improving performance on these 4D tasks, as model size increases from 20M all the way to the largest by far reported self-supervised video model $\unicode{x2013}$ 22B parameters. Rigorous apples-to-apples comparison with many recent image and video models demonstrates the benefits of scaling 4D representations. Pretrained models are available at https://github.com/google-deepmind/representations4d .

Jo\~ao Carreira, Dilara Gokay, Michael King, Chuhan Zhang, Ignacio Rocco, Aravindh Mahendran, Thomas Albert Keck, Joseph Heyward, Skanda Koppula, Etienne Pot, Goker Erdogan, Yana Hasson, Yi Yang, Klaus Greff, Guillaume Le Moing, Sjoerd van Steenkiste, Daniel Zoran, Drew A. Hudson, Pedro V\'elez, Luisa Polan\'ia, Luke Friedman, Chris Duvarney, Ross Goroshin, Kelsey Allen, Jacob Walker, Rishabh Kabra, Eric Aboussouan, Jennifer Sun, Thomas Kipf, Carl Doersch, Viorica P\u{a}tr\u{a}ucean, Dima Damen, Pauline Luc, Mehdi S. M. Sajjadi, Andrew Zisserman• 2024

Related benchmarks

Task	Dataset	Result
Action Recognition	SSV2	Top-1 Acc60.3	142
Video Object Segmentation	DAVIS	--	134
Monocular Depth Estimation	ScanNet	AbsRel1.21	111
Action Recognition	Kinetics	Top-1 Acc46.4	83
Video Classification	Something-Something v2 (val)	Top-1 Acc68.2	77
Semantic segmentation	Waymo	mIoU0.763	14
Point Tracking	Perception (test)	AJ Score70.4	14
Video Semantic Segmentation	Waymo	mIoU76.3	13
Video Classification	SS v2	Accuracy (%)60.3	13
Human Pose Estimation	JHMDB	PCK@0.144.4	12

Showing 10 of 22 rows

Other info

Follow for update

@wizwand_team Discord