Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Siamese Masked Autoencoders

About

Establishing correspondence between images or scenes is a significant challenge in computer vision, especially given occlusions, viewpoint changes, and varying object appearances. In this paper, we present Siamese Masked Autoencoders (SiamMAE), a simple extension of Masked Autoencoders (MAE) for learning visual correspondence from videos. SiamMAE operates on pairs of randomly sampled video frames and asymmetrically masks them. These frames are processed independently by an encoder network, and a decoder composed of a sequence of cross-attention layers is tasked with predicting the missing patches in the future frame. By masking a large fraction ($95\%$) of patches in the future frame while leaving the past frame unchanged, SiamMAE encourages the network to focus on object motion and learn object-centric representations. Despite its conceptual simplicity, features learned via SiamMAE outperform state-of-the-art self-supervised methods on video object segmentation, pose keypoint propagation, and semantic part propagation tasks. SiamMAE achieves competitive results without relying on data augmentation, handcrafted tracking-based pretext tasks, or other techniques to prevent representational collapse.

Agrim Gupta, Jiajun Wu, Jia Deng, Li Fei-Fei• 2023

Related benchmarks

TaskDatasetResultRank
Video Object SegmentationDAVIS 2017 (val)
J mean56
1130
Action RecognitionSSV2
Top-1 Acc56
93
Monocular Depth EstimationScanNet
AbsRel1
64
Video Object SegmentationDAVIS
J Mean62.8
58
Video Instance ParsingVIP (val)
mIoU33.2
20
Human Pose EstimationJHMDB (val)
PCK@0.146.1
19
wet-AMD conversion predictionHARBOR 6-month window (test)
AUROC0.666
19
wet-AMD conversion predictionHARBOR 12-month window (test)
AUROC0.638
19
Human Pose EstimationJHMDB
PCK@0.147.2
12
AD conversion predictionADNI 3-years window
AUROC78.7
8
Showing 10 of 15 rows

Other info

Follow for update