Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation

About

Multimodal large language models (MLLMs) demonstrate strong video understanding by attending to visual tokens relevant to textual queries. To directly adapt this for localization in a training-free manner, we cast video reasoning segmentation as a video QA task and extract attention maps via rollout mechanism. However, raw attention maps are noisy and poorly aligned with object regions. We propose Decomposed Attention Fusion (DecAF), which refines these maps through two mechanisms: (1) contrastive object-background fusion and (2) complementary video-frame fusion. This method suppresses irrelevant activations and enhances object-focused cues, enabling direct conversion of attention maps into coarse segmentation masks. In addition, we introduce attention-guided SAM2 prompting for obtaining fine-grained masks. Unlike existing methods that jointly train MLLMs with SAM, our method operates entirely without retraining. DecAF outperforms training-free methods and achieves performance comparable to training-based methods on both referring and reasoning VOS benchmarks.

Su Ho Han, Jeongseok Hyun, Pilhyeon Lee, Minho Shim, Dongyoon Wee, Seon Joo Kim• 2025

Related benchmarks

Task	Dataset	Result
Referring Video Object Segmentation	Ref-YouTube-VOS	J&F59.9	143
Referring Video Segmentation	MeViS	J&F Score48.1	101
Reasoning Video Object Segmentation	ReVOS Reasoning	J&F Score49.7	75
Referring Video Object Segmentation	Ref-DAVIS	J&F Score75.2	59
Video Referring Segmentation	ReVOS Referring	J&F Score58.7	51
Video Reasoning Segmentation	ReVOS Referring	J&F Score58.7	49
Video Reasoning Segmentation	ReVOS Overall	J&F Score54.2	49
Reasoning Video Object Segmentation	ReasonVOS	J&F Score63.9	43
Video Object Segmentation	ReVOS Overall	J&F Score54.2	24
Video Object Segmentation	ReasonVOS	J&F Score63.9	21

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord