Zero-Shot Object-Centric Representation Learning
About
The goal of object-centric representation learning is to decompose visual scenes into a structured representation that isolates the entities. Recent successes have shown that object-centric representation learning can be scaled to real-world scenes by utilizing pre-trained self-supervised features. However, so far, object-centric methods have mostly been applied in-distribution, with models trained and evaluated on the same dataset. This is in contrast to the wider trend in machine learning towards general-purpose models directly applicable to unseen data and tasks. Thus, in this work, we study current object-centric methods through the lens of zero-shot generalization by introducing a benchmark comprising eight different synthetic and real-world datasets. We analyze the factors influencing zero-shot performance and find that training on diverse real-world images improves transferability to unseen scenarios. Furthermore, inspired by the success of task-specific fine-tuning in foundation models, we introduce a novel fine-tuning strategy to adapt pre-trained vision encoders for the task of object discovery. We find that the proposed approach results in state-of-the-art performance for unsupervised object discovery, exhibiting strong zero-shot transfer to unseen datasets.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination Evaluation | POPE | Accuracy81.54 | 935 | |
| Visual Question Answering | GQA | Accuracy56.15 | 374 | |
| Multimodal Evaluation | MM-Vet | Accuracy17.2 | 122 | |
| Counterfactual reasoning | CVQA | Accuracy68.85 | 40 | |
| Multi-modal Perception Evaluation | MME Perception | Perception Score1.24e+3 | 31 | |
| Vision-Language Compositionality | SugarCrepe | Accuracy81.24 | 20 | |
| OOD Generalization | OODCV | Accuracy55.18 | 20 | |
| Robustness to Natural Adversarial Examples | NaturalBench | Accuracy5.42 | 20 | |
| Grounded Visual Question Answering | Grounded GQA enhanced (test) | mIoU59.09 | 16 | |
| Multimodal Perception Evaluation | MME | Perception Score1.02e+3 | 12 |