Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Zero-Shot Object-Centric Representation Learning

About

The goal of object-centric representation learning is to decompose visual scenes into a structured representation that isolates the entities. Recent successes have shown that object-centric representation learning can be scaled to real-world scenes by utilizing pre-trained self-supervised features. However, so far, object-centric methods have mostly been applied in-distribution, with models trained and evaluated on the same dataset. This is in contrast to the wider trend in machine learning towards general-purpose models directly applicable to unseen data and tasks. Thus, in this work, we study current object-centric methods through the lens of zero-shot generalization by introducing a benchmark comprising eight different synthetic and real-world datasets. We analyze the factors influencing zero-shot performance and find that training on diverse real-world images improves transferability to unseen scenarios. Furthermore, inspired by the success of task-specific fine-tuning in foundation models, we introduce a novel fine-tuning strategy to adapt pre-trained vision encoders for the task of object discovery. We find that the proposed approach results in state-of-the-art performance for unsupervised object discovery, exhibiting strong zero-shot transfer to unseen datasets.

Aniket Didolkar, Andrii Zadaianchuk, Anirudh Goyal, Mike Mozer, Yoshua Bengio, Georg Martius, Maximilian Seitzer• 2024

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy81.54
935
Visual Question AnsweringGQA
Accuracy56.15
374
Multimodal EvaluationMM-Vet
Accuracy17.2
122
Counterfactual reasoningCVQA
Accuracy68.85
40
Multi-modal Perception EvaluationMME Perception
Perception Score1.24e+3
31
Vision-Language CompositionalitySugarCrepe
Accuracy81.24
20
OOD GeneralizationOODCV
Accuracy55.18
20
Robustness to Natural Adversarial ExamplesNaturalBench
Accuracy5.42
20
Grounded Visual Question AnsweringGrounded GQA enhanced (test)
mIoU59.09
16
Multimodal Perception EvaluationMME
Perception Score1.02e+3
12
Showing 10 of 11 rows

Other info

Follow for update