CoT-RVS: Zero-Shot Chain-of-Thought Reasoning Segmentation for Videos

About

Reasoning Video Object Segmentation is a challenging task, aiming at generating a mask sequence from an input video given a complex and implicit text query. While existing works finetune Multimodal Large Language Models (MLLM) for the task, they still fail in video inputs given complex temporally-sensitive queries, indicating their lack of temporal and spatial integration in complex scenarios. In this paper, we propose CoT-RVS, a novel framework employing the zero-shot Chain-of-Thought (CoT) capability of MLLM to address these complex challenges by temporal-semantic reasoning: CoT-RVS analyzes the visible objects within a given frame that possibly match the language query (semantic), and chooses a corresponding keyframe for each object that can be observed effortlessly among all frames (temporal). Notably, the CoT-RVS framework is training-free and compatible with closed-source MLLMs, which can be applied to Reasoning Video Instance Segmentation. Our framework's training-free feature further allows its extension to process online video streams, where the CoT is used at test time to update the object of interest when a better target starts to emerge and becomes visible. We conduct extensive experiments on video object segmentation with explicit and implicit queries. The results show that CoT-RVS significantly outperforms previous works in both cases, qualitatively and quantitatively.

Shiu-hong Kao, Yu-Wing Tai, Chi-Keung Tang• 2025

Related benchmarks

Task	Dataset	Result
Referring Video Segmentation	MeViS	J&F Score52.2	101
Reasoning Video Object Segmentation	ReasonVOS	J&F Score65.5	43
Video Object Segmentation	ReVOS Overall	J&F Score55.9	24
Video Object Segmentation	ReasonVOS	J&F Score50.7	21
Referring and Reasoning Video Object Segmentation	ReVOS	Overall J&F Score55.9	16

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord