Evaluating Object-Centric Models beyond Object Discovery

About

Object-centric learning (OCL) aims to learn structured scene representations that support compositional generalization and robustness to out-of-distribution (OOD) data. However, OCL models are often not evaluated regarding these goals. Instead, most prior work focuses on evaluating OCL models solely through object discovery and simple reasoning tasks, such as probing the representation via image classification. We identify two limitations in existing benchmarks: (1) They provide limited insights on the representation usefulness of OCL models, and (2) localization and representation usefulness are assessed using disjoint metrics. To address (1), we use instruction-tuned VLMs as evaluators, enabling scalable benchmarking across diverse VQA datasets to measure how well VLMs leverage OCL representations for complex reasoning tasks. To address (2), we introduce a unified evaluation task and metric that jointly assess localization (where) and representation usefulness (what), thereby eliminating inconsistencies introduced by disjoint evaluation. Finally, we include a simple multi-feature reconstruction baseline as a reference point.

Krishnakant Singh, Simone Schaub-Meyer, Stefan Roth• 2026

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy82.2	2019
Visual Question Answering	GQA	Accuracy58.28	524
Multimodal Evaluation	MM-Vet	--	196
Counterfactual reasoning	CVQA	Accuracy66.64	40
Multi-modal Perception Evaluation	MME Perception	Perception Score1.28e+3	31
Vision-Language Compositionality	SugarCrepe	Accuracy83.17	20
OOD Generalization	OODCV	Accuracy57.31	20
Robustness to Natural Adversarial Examples	NaturalBench	Accuracy6.84	20
Grounded Visual Question Answering	Grounded GQA enhanced (test)	mIoU56.92	16
Multimodal Perception Evaluation	MME	--	16

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord