Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Object-Centric Learning for Real-World Videos by Predicting Temporal Feature Similarities

About

Unsupervised video-based object-centric learning is a promising avenue to learn structured representations from large, unlabeled video collections, but previous approaches have only managed to scale to real-world datasets in restricted domains. Recently, it was shown that the reconstruction of pre-trained self-supervised features leads to object-centric representations on unconstrained real-world image datasets. Building on this approach, we propose a novel way to use such pre-trained features in the form of a temporal feature similarity loss. This loss encodes semantic and temporal correlations between image patches and is a natural way to introduce a motion bias for object discovery. We demonstrate that this loss leads to state-of-the-art performance on the challenging synthetic MOVi datasets. When used in combination with the feature reconstruction loss, our model is the first object-centric video model that scales to unconstrained video datasets such as YouTube-VIS.

Andrii Zadaianchuk, Maximilian Seitzer, Georg Martius• 2023

Related benchmarks

TaskDatasetResultRank
Video segmentationDAVIS--
14
Unsupervised Video Object SegmentationDAVIS U17 (val)
J&F Mean Score29
11
object dynamics predictionYouTube VIS 2021 (test)
FG-ARI28.9
9
Unsupervised object-centric learningAbdominal surgical dataset (test)
mBO-V46.3
8
Unsupervised object-centric learningCholecystectomy surgical dataset (test)
mBO-V30.1
8
Unsupervised object-centric learningThoracic surgical dataset (test)
mBO-V21.9
8
Object DiscoveryMOVi-E v1 (test)
FG-ARI73.9
7
Unsupervised image segmentationMOVi-E individual frames--
7
Object DiscoveryMOVi-C v1 (test)
FG-ARI64.8
6
Unsupervised image segmentationMOVi-C individual frames--
6
Showing 10 of 11 rows

Other info

Code

Follow for update