Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

About

This work presents Sa2VA, the first comprehensive, unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one-shot instruction tuning. Sa2VA combines SAM-2, a foundation video segmentation model, with MLLM, the advanced vision-language model, and unifies text, image, and video into a shared LLM token space. Using the LLM, Sa2VA generates instruction tokens that guide SAM-2 in producing precise masks, enabling a grounded, multi-modal understanding of both static and dynamic visual content. Additionally, we introduce Ref-SAV, an auto-labeled dataset containing over 72k object expressions in complex video scenes, designed to boost model performance. We also manually validate 2k video objects in the Ref-SAV datasets to benchmark referring video object segmentation in complex environments. Experiments show that Sa2VA achieves strong performance across multiple tasks, particularly in referring video object segmentation, highlighting its potential for complex real-world applications. In addition, Sa2VA can be easily extended into various VLMs, including Qwen-VL and Intern-VL, which can be updated with rapid process in current open-sourced VLMs. Code and models have been provided to the community.

Haobo Yuan, Xiangtai Li, Tao Zhang, Yueyi Sun, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, Ming-Hsuan Yang• 2025

Related benchmarks

Task	Dataset	Result
Multiple Object Tracking	MOT20 (test)	--	458
Reasoning Segmentation	ReasonSeg (val)	gIoU65.2	327
Diagram Understanding	AI2D	Accuracy82.1	317
Referring Expression Segmentation	RefCOCO (testA)	cIoU84.2	315
Referring Expression Segmentation	RefCOCO+ (testA)	cIoU81.2	288
Science Question Answering	ScienceQA (test)	Average Accuracy96.8	273
Referring Expression Segmentation	RefCOCO+ (val)	cIoU77.6	272
Referring Expression Segmentation	RefCOCO (val)	cIoU82.4	261
Referring Expression Segmentation	RefCOCO (testB)	cIoU79.5	259
Referring Expression Segmentation	RefCOCO+ (testB)	cIoU73.1	256

Showing 10 of 107 rows

...

Other info

Follow for update

@wizwand_team Discord