CTRL-O: Language-Controllable Object-Centric Visual Representation Learning

About

Object-centric representation learning aims to decompose visual scenes into fixed-size vectors called "slots" or "object files", where each slot captures a distinct object. Current state-of-the-art object-centric models have shown remarkable success in object discovery in diverse domains, including complex real-world scenes. However, these models suffer from a key limitation: they lack controllability. Specifically, current object-centric models learn representations based on their preconceived understanding of objects, without allowing user input to guide which objects are represented. Introducing controllability into object-centric models could unlock a range of useful capabilities, such as the ability to extract instance-specific representations from a scene. In this work, we propose a novel approach for user-directed control over slot representations by conditioning slots on language descriptions. The proposed ConTRoLlable Object-centric representation learning approach, which we term CTRL-O, achieves targeted object-language binding in complex real-world scenes without requiring mask supervision. Next, we apply these controllable slot representations on two downstream vision language tasks: text-to-image generation and visual question answering. The proposed approach enables instance-specific text-to-image generation and also achieves strong performance on visual question answering.

Aniket Didolkar, Andrii Zadaianchuk, Rabiul Awal, Maximilian Seitzer, Efstratios Gavves, Aishwarya Agrawal• 2025

Related benchmarks

Task	Dataset	Result
Referring Expression Segmentation	RefCOCO (testA)	--	332
Referring Expression Segmentation	RefCOCO+ (testA)	--	305
Referring Expression Segmentation	RefCOCO+ (val)	--	284
Referring Expression Segmentation	RefCOCO (val)	--	273
Referring Expression Segmentation	RefCOCO (testB)	--	259
Referring Expression Segmentation	RefCOCO+ (testB)	--	256
Visual Question Answering	VQA v2 (val)	Accuracy60.25	158
Object Discovery	COCO (val)	FG-ARI47.5	11
Referring Expression Segmentation	Gref (val)	mIoU30.5	7

Showing 9 of 9 rows

Other info

Code

Follow for update

@wizwand_team Discord