Evaluating Object-Centric Models beyond Object Discovery
About
Object-centric learning (OCL) aims to learn structured scene representations that support compositional generalization and robustness to out-of-distribution (OOD) data. However, OCL models are often not evaluated regarding these goals. Instead, most prior work focuses on evaluating OCL models solely through object discovery and simple reasoning tasks, such as probing the representation via image classification. We identify two limitations in existing benchmarks: (1) They provide limited insights on the representation usefulness of OCL models, and (2) localization and representation usefulness are assessed using disjoint metrics. To address (1), we use instruction-tuned VLMs as evaluators, enabling scalable benchmarking across diverse VQA datasets to measure how well VLMs leverage OCL representations for complex reasoning tasks. To address (2), we introduce a unified evaluation task and metric that jointly assess localization (where) and representation usefulness (what), thereby eliminating inconsistencies introduced by disjoint evaluation. Finally, we include a simple multi-feature reconstruction baseline as a reference point.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination Evaluation | POPE | Accuracy82.2 | 935 | |
| Visual Question Answering | GQA | Accuracy58.28 | 374 | |
| Multimodal Evaluation | MM-Vet | Accuracy19.3 | 122 | |
| Counterfactual reasoning | CVQA | Accuracy66.64 | 40 | |
| Multi-modal Perception Evaluation | MME Perception | Perception Score1.28e+3 | 31 | |
| Vision-Language Compositionality | SugarCrepe | Accuracy83.17 | 20 | |
| OOD Generalization | OODCV | Accuracy57.31 | 20 | |
| Robustness to Natural Adversarial Examples | NaturalBench | Accuracy6.84 | 20 | |
| Grounded Visual Question Answering | Grounded GQA enhanced (test) | mIoU56.92 | 16 | |
| Multimodal Perception Evaluation | MME | Perception Score1.18e+3 | 12 |