Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation

About

In this paper, we propose an Audio-Language-Referenced SAM 2 (AL-Ref-SAM 2) pipeline to explore the training-free paradigm for audio and language-referenced video object segmentation, namely AVS and RVOS tasks. The intuitive solution leverages GroundingDINO to identify the target object from a single frame and SAM 2 to segment the identified object throughout the video, which is less robust to spatiotemporal variations due to a lack of video context exploration. Thus, in our AL-Ref-SAM 2 pipeline, we propose a novel GPT-assisted Pivot Selection (GPT-PS) module to instruct GPT-4 to perform two-step temporal-spatial reasoning for sequentially selecting pivot frames and pivot boxes, thereby providing SAM 2 with a high-quality initial object prompt. Within GPT-PS, two task-specific Chain-of-Thought prompts are designed to unleash GPT's temporal-spatial reasoning capacity by guiding GPT to make selections based on a comprehensive understanding of video and reference information. Furthermore, we propose a Language-Binded Reference Unification (LBRU) module to convert audio signals into language-formatted references, thereby unifying the formats of AVS and RVOS tasks in the same pipeline. Extensive experiments on both tasks show that our training-free AL-Ref-SAM 2 pipeline achieves performances comparable to or even better than fully-supervised fine-tuning methods. The code is available at: https://github.com/appletea233/AL-Ref-SAM2.

Shaofei Huang, Rui Ling, Hongyu Li, Tianrui Hui, Zongheng Tang, Xiaoming Wei, Jizhong Han, Si Liu• 2024

Related benchmarks

Task	Dataset	Result
Referring Video Segmentation	Ref-YouTube-VOS	J&F Score67.9	108
Referring Video Segmentation	MeViS	J&F Score42.8	101
Referring Video Object Segmentation	Ref-DAVIS	J&F Score74.2	59
Referring Video Object Segmentation	YoURVOS (test)	J&F26.3	40
Audio-Visual Segmentation	AVSBench single-source V1	MJ Score70.5	13
Audio-Visual Segmentation	AVSBench multiple-source V1	MJ48.6	13
Audio-Visual Segmentation	S4	mIoU (MJ)70.5	11
Audio-Visual Segmentation	MS3	mIoU (MJ)48.6	11
Audio-Visual Segmentation	AVSS V2	MJ Score59.2	9
Audio-Visual Segmentation	AVSBench binary V2	MJ Score59.2	8

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord