Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Recurrent Video Masked Autoencoders

About

We present Recurrent Video Masked-Autoencoders (RVM): a novel approach to video representation learning that leverages recurrent computation to model the temporal structure of video data. RVM couples an asymmetric masking objective with a transformer-based recurrent neural network to aggregate information over time, training solely on a simple pixel reconstruction loss. This design yields a highly efficient "generalist" encoder: RVM achieves competitive performance with state-of-the-art video models (e.g. VideoMAE, V-JEPA) on video-level tasks like action classification, and point and object tracking, while matching or exceeding the performance of image models (e.g. DINOv2) on tasks that require strong geometric and dense spatial features. Notably, RVM achieves strong performance in the small-model regime without requiring knowledge distillation, exhibiting up to 30x greater parameter efficiency than competing video masked autoencoders. Finally, we demonstrate that RVM's recurrent nature allows for stable feature propagation over long temporal horizons with linear computational cost, overcoming some of the limitations of standard spatio-temporal attention-based video models. Ablation studies further highlight the factors driving the model's success, with qualitative results showing that RVM learns rich representations of scene semantics, structure, and motion.

Daniel Zoran, Nikhil Parthasarathy, Yi Yang, Drew A Hudson, Joao Carreira, Andrew Zisserman• 2025

Related benchmarks

TaskDatasetResultRank
Action RecognitionSSV2
Top-1 Acc61.4
142
Video Object SegmentationDAVIS--
128
Monocular Depth EstimationScanNet
AbsRel0.97
103
Action RecognitionKinetics
Top-1 Acc53.1
83
Action RecognitionSomething-Something V2 (Full)
Top-1 Accuracy40.08
18
Action RecognitionEgo-Exo4D Bike Repair
Top-1 Accuracy22.76
16
Semantic segmentationWaymo
mIoU0.711
14
Point TrackingPerception (test)
AJ Score68.1
14
Video ClassificationSS v2
Accuracy (%)66.7
13
Video Semantic SegmentationWaymo
mIoU73.2
13
Showing 10 of 25 rows

Other info

Follow for update