Predicting Video Slot Attention Queries from Random Slot-Feature Pairs

About

Unsupervised video Object-Centric Learning (OCL) is promising as it enables object-level scene representation and understanding as we humans do. Mainstream video OCL methods adopt a recurrent architecture: An aggregator aggregates current video frame into object features, termed slots, under some queries; A transitioner transits current slots to queries for the next frame. This is an effective architecture but all existing implementations both (\textit{i1}) neglect to incorporate next frame features, the most informative source for query prediction, and (\textit{i2}) fail to learn transition dynamics, the knowledge essential for query prediction. To address these issues, we propose Random Slot-Feature pair for learning Query prediction (RandSF.Q): (\textit{t1}) We design a new transitioner to incorporate both slots and features, which provides more information for query prediction; (\textit{t2}) We train the transitioner to predict queries from slot-feature pairs randomly sampled from available recurrences, which drives it to learn transition dynamics. Experiments on scene representation demonstrate that our method surpass existing video OCL methods significantly, e.g., up to 10 points on object discovery, setting new state-of-the-art. Such superiority also benefits downstream tasks like scene understanding. Source Code, Model Checkpoints, Training Logs: https://github.com/Genera1Z/RandSF.Q

Rongzhen Zhao, Jian Li, Juho Kannala, Joni Pajarinen• 2025

Related benchmarks

Task	Dataset	Result
Object Discovery	MOVi-C	mBOi29.2	22
object recognition	YTVIS-HQ	Top-1 Accuracy (Class)90.5	11
Object Discovery	YTVIS-HQ	ARI46	8
Object Discovery	YTVIS 2022	ARI40.3	8
object dynamics prediction	YouTube-VIS	FG-ARI46.6	7
Unsupervised Video Object Discovery	MOVi-C conditional (test)	ARI65.4	7
Unsupervised Video Object Discovery	YTVIS-HQ unconditional (test)	ARI40.1	7
Unsupervised Video Object Discovery	MOVi-E conditional (test)	ARI30.5	7
Video Object-Centric Learning	YTVIS	ARI46	6
Object Discovery	MOVi-D	ARI41.6	5

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord