Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CoT-RVS: Zero-Shot Chain-of-Thought Reasoning Segmentation for Videos

About

Reasoning Video Object Segmentation is a challenging task, aiming at generating a mask sequence from an input video given a complex and implicit text query. While existing works finetune Multimodal Large Language Models (MLLM) for the task, they still fail in video inputs given complex temporally-sensitive queries, indicating their lack of temporal and spatial integration in complex scenarios. In this paper, we propose CoT-RVS, a novel framework employing the zero-shot Chain-of-Thought (CoT) capability of MLLM to address these complex challenges by temporal-semantic reasoning: CoT-RVS analyzes the visible objects within a given frame that possibly match the language query (semantic), and chooses a corresponding keyframe for each object that can be observed effortlessly among all frames (temporal). Notably, the CoT-RVS framework is training-free and compatible with closed-source MLLMs, which can be applied to Reasoning Video Instance Segmentation. Our framework's training-free feature further allows its extension to process online video streams, where the CoT is used at test time to update the object of interest when a better target starts to emerge and becomes visible. We conduct extensive experiments on video object segmentation with explicit and implicit queries. The results show that CoT-RVS significantly outperforms previous works in both cases, qualitatively and quantitatively.

Shiu-hong Kao, Yu-Wing Tai, Chi-Keung Tang• 2025

Related benchmarks

TaskDatasetResultRank
Referring Video SegmentationMeViS
J&F Score52.2
81
Reasoning Video Object SegmentationReasonVOS
J&F Score65.5
23
Referring and Reasoning Video Object SegmentationReVOS
Overall J&F Score55.9
16
Video Object SegmentationReVOS Overall
J&F Score55.9
10
Showing 4 of 4 rows

Other info

Follow for update