Object-Centric Learning for Real-World Videos by Predicting Temporal Feature Similarities

About

Unsupervised video-based object-centric learning is a promising avenue to learn structured representations from large, unlabeled video collections, but previous approaches have only managed to scale to real-world datasets in restricted domains. Recently, it was shown that the reconstruction of pre-trained self-supervised features leads to object-centric representations on unconstrained real-world image datasets. Building on this approach, we propose a novel way to use such pre-trained features in the form of a temporal feature similarity loss. This loss encodes semantic and temporal correlations between image patches and is a natural way to introduce a motion bias for object discovery. We demonstrate that this loss leads to state-of-the-art performance on the challenging synthetic MOVi datasets. When used in combination with the feature reconstruction loss, our model is the first object-centric video model that scales to unconstrained video datasets such as YouTube-VIS.

Andrii Zadaianchuk, Maximilian Seitzer, Georg Martius• 2023

Related benchmarks

Task	Dataset	Result
Video segmentation	DAVIS	--	41
Object Discovery	MOVi-C	mBOi16.1	22
Unsupervised Video Object Segmentation	DAVIS U17 (val)	J&F Mean Score29	11
object dynamics prediction	YouTube VIS 2021 (test)	FG-ARI28.9	9
Unsupervised object-centric learning	Abdominal surgical dataset (test)	mBO-V46.3	8
Unsupervised object-centric learning	Cholecystectomy surgical dataset (test)	mBO-V30.1	8
Object Discovery	YTVIS-HQ	ARI33.8	8
Object Discovery	YTVIS 2022	ARI33.4	8
Unsupervised object-centric learning	Thoracic surgical dataset (test)	mBO-V21.9	8
Video Object Discovery	MOVi-C synthetic (test)	ARI41.9	8

Showing 10 of 40 rows

Other info

Code

Follow for update

@wizwand_team Discord