Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Training-Free Spatio-temporal Decoupled Reasoning Video Segmentation with Adaptive Object Memory

About

Reasoning Video Object Segmentation (ReasonVOS) is a challenging task that requires stable object segmentation across video sequences using implicit and complex textual inputs. Previous methods fine-tune Multimodal Large Language Models (MLLMs) to produce segmentation outputs, which demand substantial resources. Additionally, some existing methods are coupled in the processing of spatio-temporal information, which affects the temporal stability of the model to some extent. To address these issues, we propose Training-Free \textbf{S}patio-temporal \textbf{D}ecoupled Reasoning Video Segmentation with \textbf{A}daptive Object \textbf{M}emory (SDAM). We aim to design a training-free reasoning video segmentation framework that outperforms existing methods requiring fine-tuning, using only pre-trained models. Meanwhile, we propose an Adaptive Object Memory module that selects and memorizes key objects based on motion cues in different video sequences. Finally, we propose Spatio-temporal Decoupling for stable temporal propagation. In the spatial domain, we achieve precise localization and segmentation of target objects, while in the temporal domain, we leverage key object temporal information to drive stable cross-frame propagation. Our method achieves excellent results on five benchmark datasets, including Ref-YouTubeVOS, Ref-DAVIS17, MeViS, ReasonVOS, and ReVOS.

Zhengtong Zhu, Jiaqing Fan, Zhixuan Liu, Fanzhang Li• 2026

Related benchmarks

TaskDatasetResultRank
Referring Video Object SegmentationRef-YouTube-VOS (val)
J&F Score65.3
244
Referring Video Object SegmentationRef-DAVIS 2017 (val)
J&F76
230
Referring Video Object SegmentationMeViS (val)
J&F Score0.486
166
Referring Video Object SegmentationRef-DAVIS
J&F Score76
59
Reasoning Video Object SegmentationReasonVOS
J&F Score55.1
43
Referring Video Object SegmentationRef-YTVOS
J&F Score65.3
21
Referring Video Object SegmentationMeViS v1
J&F Score48.6
19
Referring Video Object SegmentationReVOS
J&F Score58
15
Referring Video Object SegmentationReasonVOS
J&F Score55.1
12
Reasoning Video Object SegmentationReVOS (val)
Referring J&F Score61.5
11
Showing 10 of 10 rows

Other info

Follow for update