Recurrent Video Masked Autoencoders

About

We present Recurrent Video Masked-Autoencoders (RVM): a novel approach to video representation learning that leverages recurrent computation to model the temporal structure of video data. RVM couples an asymmetric masking objective with a transformer-based recurrent neural network to aggregate information over time, training solely on a simple pixel reconstruction loss. This design yields a highly efficient "generalist" encoder: RVM achieves competitive performance with state-of-the-art video models (e.g. VideoMAE, V-JEPA) on video-level tasks like action classification, and point and object tracking, while matching or exceeding the performance of image models (e.g. DINOv2) on tasks that require strong geometric and dense spatial features. Notably, RVM achieves strong performance in the small-model regime without requiring knowledge distillation, exhibiting up to 30x greater parameter efficiency than competing video masked autoencoders. Finally, we demonstrate that RVM's recurrent nature allows for stable feature propagation over long temporal horizons with linear computational cost, overcoming some of the limitations of standard spatio-temporal attention-based video models. Ablation studies further highlight the factors driving the model's success, with qualitative results showing that RVM learns rich representations of scene semantics, structure, and motion.

Daniel Zoran, Nikhil Parthasarathy, Yi Yang, Drew A Hudson, Joao Carreira, Andrew Zisserman• 2025

Related benchmarks

Task	Dataset	Result
Action Recognition	SSV2	Top-1 Acc61.4	142
Video Object Segmentation	DAVIS	--	128
Monocular Depth Estimation	ScanNet	AbsRel0.97	103
Action Recognition	Kinetics	Top-1 Acc53.1	83
Action Recognition	Something-Something V2 (Full)	Top-1 Accuracy40.08	18
Action Recognition	Ego-Exo4D Bike Repair	Top-1 Accuracy22.76	16
Semantic segmentation	Waymo	mIoU0.711	14
Point Tracking	Perception (test)	AJ Score68.1	14
Video Classification	SS v2	Accuracy (%)66.7	13
Video Semantic Segmentation	Waymo	mIoU73.2	13

Showing 10 of 25 rows

Other info

Follow for update

@wizwand_team Discord