Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Bridging the Gap to Real-World Object-Centric Learning

About

Humans naturally decompose their environment into entities at the appropriate level of abstraction to act in the world. Allowing machine learning algorithms to derive this decomposition in an unsupervised way has become an important line of research. However, current methods are restricted to simulated data or require additional information in the form of motion or depth in order to successfully discover objects. In this work, we overcome this limitation by showing that reconstructing features from models trained in a self-supervised manner is a sufficient training signal for object-centric representations to arise in a fully unsupervised way. Our approach, DINOSAUR, significantly out-performs existing image-based object-centric learning models on simulated data and is the first unsupervised object-centric model that scales to real-world datasets such as COCO and PASCAL VOC. DINOSAUR is conceptually simple and shows competitive performance compared to more involved pipelines from the computer vision literature.

Maximilian Seitzer, Max Horn, Andrii Zadaianchuk, Dominik Zietlow, Tianjun Xiao, Carl-Johann Simon-Gabriel, Tong He, Zheng Zhang, Bernhard Sch\"olkopf, Thomas Brox, Francesco Locatello• 2022

Related benchmarks

TaskDatasetResultRank
Semantic segmentationPASCAL VOC 2012 (test)
mIoU37.2
1342
Object Hallucination EvaluationPOPE
Accuracy81.84
935
Visual Question AnsweringGQA
Accuracy56.32
374
Multimodal EvaluationMM-Vet
Accuracy18.9
122
Visual Question AnsweringVQA v2 (val)
Accuracy58.32
99
Semantic segmentationCOCO Stuff-27 (val)
mIoU24
75
Counterfactual reasoningCVQA
Accuracy69.29
40
Multi-modal Perception EvaluationMME Perception
Perception Score1.22e+3
31
Unsupervised Object SegmentationCOCO
mBOi31.6
26
OOD GeneralizationOODCV
Accuracy56.66
20
Showing 10 of 61 rows

Other info

Follow for update