Object-centric Learning with Cyclic Walks between Parts and Whole

About

Learning object-centric representations from complex natural environments enables both humans and machines with reasoning abilities from low-level perceptual features. To capture compositional entities of the scene, we proposed cyclic walks between perceptual features extracted from vision transformers and object entities. First, a slot-attention module interfaces with these perceptual features and produces a finite set of slot representations. These slots can bind to any object entities in the scene via inter-slot competitions for attention. Next, we establish entity-feature correspondence with cyclic walks along high transition probability based on the pairwise similarity between perceptual features (aka "parts") and slot-binded object representations (aka "whole"). The whole is greater than its parts and the parts constitute the whole. The part-whole interactions form cycle consistencies, as supervisory signals, to train the slot-attention module. Our rigorous experiments on \textit{seven} image datasets in \textit{three} \textit{unsupervised} tasks demonstrate that the networks trained with our cyclic walks can disentangle foregrounds and backgrounds, discover objects, and segment semantic objects in complex scenes. In contrast to object-centric models attached with a decoder for the pixel-level or feature-level reconstructions, our cyclic walks provide strong learning signals, avoiding computation overheads and enhancing memory efficiency. Our source code and data are available at: \href{https://github.com/ZhangLab-DeepNeuroCogLab/Parts-Whole-Object-Centric-Learning/}{link}.

Ziyu Wang, Mike Zheng Shou, Mengmi Zhang• 2023

Related benchmarks

Task	Dataset	Result
Semantic segmentation	PASCAL VOC 2012	mIoU43.3	218
Semantic segmentation	COCO-Stuff 27	mIoU22.5	67
Object Discovery	PASCAL VOC 2012 (val)	--	14
Object Discovery	MOVi-C (val)	fg-ARI67.6	7
Object Discovery	COCO 2017 (val)	FG-ARI39.7	6
Unsupervised Foreground Extraction	CUB200 Birds (test)	mIoU72.4	5
Unsupervised Foreground Extraction	Stanford Dogs (test)	mIoU86.2	5
Unsupervised Foreground Extraction	Stanford Cars (test)	mIoU0.902	5
Unsupervised Foreground Extraction	Flowers (test)	mIoU75.1	5
Unsupervised Object Discovery	CLEVRTex (test)	FG-ARI67.4	5

Showing 10 of 11 rows

Other info

Code

Follow for update

@wizwand_team Discord