Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Evaluating Object-Centric Models beyond Object Discovery

About

Object-centric learning (OCL) aims to learn structured scene representations that support compositional generalization and robustness to out-of-distribution (OOD) data. However, OCL models are often not evaluated regarding these goals. Instead, most prior work focuses on evaluating OCL models solely through object discovery and simple reasoning tasks, such as probing the representation via image classification. We identify two limitations in existing benchmarks: (1) They provide limited insights on the representation usefulness of OCL models, and (2) localization and representation usefulness are assessed using disjoint metrics. To address (1), we use instruction-tuned VLMs as evaluators, enabling scalable benchmarking across diverse VQA datasets to measure how well VLMs leverage OCL representations for complex reasoning tasks. To address (2), we introduce a unified evaluation task and metric that jointly assess localization (where) and representation usefulness (what), thereby eliminating inconsistencies introduced by disjoint evaluation. Finally, we include a simple multi-feature reconstruction baseline as a reference point.

Krishnakant Singh, Simone Schaub-Meyer, Stefan Roth• 2026

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy82.2
935
Visual Question AnsweringGQA
Accuracy58.28
374
Multimodal EvaluationMM-Vet
Accuracy19.3
122
Counterfactual reasoningCVQA
Accuracy66.64
40
Multi-modal Perception EvaluationMME Perception
Perception Score1.28e+3
31
Vision-Language CompositionalitySugarCrepe
Accuracy83.17
20
OOD GeneralizationOODCV
Accuracy57.31
20
Robustness to Natural Adversarial ExamplesNaturalBench
Accuracy6.84
20
Grounded Visual Question AnsweringGrounded GQA enhanced (test)
mIoU56.92
16
Multimodal Perception EvaluationMME
Perception Score1.18e+3
12
Showing 10 of 11 rows

Other info

Follow for update