SPOT: Self-Training with Patch-Order Permutation for Object-Centric Learning with Autoregressive Transformers
About
Unsupervised object-centric learning aims to decompose scenes into interpretable object entities, termed slots. Slot-based auto-encoders stand out as a prominent method for this task. Within them, crucial aspects include guiding the encoder to generate object-specific slots and ensuring the decoder utilizes them during reconstruction. This work introduces two novel techniques, (i) an attention-based self-training approach, which distills superior slot-based attention masks from the decoder to the encoder, enhancing object segmentation, and (ii) an innovative patch-order permutation strategy for autoregressive transformers that strengthens the role of slot vectors in reconstruction. The effectiveness of these strategies is showcased experimentally. The combined approach significantly surpasses prior slot-based autoencoder methods in unsupervised object segmentation, especially with complex real-world images. We provide the implementation code at https://github.com/gkakogeorgiou/spot .
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination Evaluation | POPE | Accuracy79.69 | 935 | |
| Visual Question Answering | GQA | Accuracy54.94 | 374 | |
| Multimodal Evaluation | MM-Vet | Accuracy17.8 | 122 | |
| Counterfactual reasoning | CVQA | Accuracy69.47 | 40 | |
| Multi-modal Perception Evaluation | MME Perception | Perception Score1.17e+3 | 31 | |
| Unsupervised Object Segmentation | COCO | mBOi35 | 26 | |
| OOD Generalization | OODCV | Accuracy54.07 | 20 | |
| Vision-Language Compositionality | SugarCrepe | Accuracy74.08 | 20 | |
| Robustness to Natural Adversarial Examples | NaturalBench | Accuracy3.68 | 20 | |
| Semantic-level object discovery | VOC | mIoU55.3 | 19 |