Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Vector-Quantized Vision Foundation Models for Object-Centric Learning

About

Object-Centric Learning (OCL) aggregates image or video feature maps into object-level feature vectors, termed \textit{slots}. It's self-supervision of reconstructing the input from slots struggles with complex object textures, thus Vision Foundation Model (VFM) representations are used as the aggregation input and reconstruction target. Existing methods leverage VFM representations in diverse ways yet fail to fully exploit their potential. In response, we propose a unified architecture, Vector-Quantized VFMs for OCL (VQ-VFM-OCL, or VVO). The key to our unification is simply shared quantizing VFM representations in OCL aggregation and decoding. Experiments show that across different VFMs, aggregators and decoders, our VVO consistently outperforms baselines in object discovery and recognition, as well as downstream visual prediction and reasoning. We also mathematically analyze why VFM representations facilitate OCL aggregation and why their shared quantization as reconstruction targets strengthens OCL supervision. Our source code and model checkpoints are available on https://github.com/Genera1Z/VQ-VFM-OCL.

Rongzhen Zhao, Vivienne Wang, Juho Kannala, Joni Pajarinen• 2025

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE--
935
Visual Question AnsweringGQA
Accuracy56.2
374
Multimodal EvaluationMM-Vet
Accuracy18.8
122
Counterfactual reasoningCVQA
Accuracy66.37
40
Multi-modal Perception EvaluationMME Perception
Perception Score1.23e+3
31
Robustness to Natural Adversarial ExamplesNaturalBench
Accuracy6.07
20
Vision-Language CompositionalitySugarCrepe
Accuracy80.16
20
OOD GeneralizationOODCV
Accuracy53.18
20
Grounded Visual Question AnsweringGrounded GQA enhanced (test)
mIoU48.5
16
Showing 9 of 9 rows

Other info

Follow for update