Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation
About
In this paper, we propose an Audio-Language-Referenced SAM 2 (AL-Ref-SAM 2) pipeline to explore the training-free paradigm for audio and language-referenced video object segmentation, namely AVS and RVOS tasks. The intuitive solution leverages GroundingDINO to identify the target object from a single frame and SAM 2 to segment the identified object throughout the video, which is less robust to spatiotemporal variations due to a lack of video context exploration. Thus, in our AL-Ref-SAM 2 pipeline, we propose a novel GPT-assisted Pivot Selection (GPT-PS) module to instruct GPT-4 to perform two-step temporal-spatial reasoning for sequentially selecting pivot frames and pivot boxes, thereby providing SAM 2 with a high-quality initial object prompt. Within GPT-PS, two task-specific Chain-of-Thought prompts are designed to unleash GPT's temporal-spatial reasoning capacity by guiding GPT to make selections based on a comprehensive understanding of video and reference information. Furthermore, we propose a Language-Binded Reference Unification (LBRU) module to convert audio signals into language-formatted references, thereby unifying the formats of AVS and RVOS tasks in the same pipeline. Extensive experiments on both tasks show that our training-free AL-Ref-SAM 2 pipeline achieves performances comparable to or even better than fully-supervised fine-tuning methods. The code is available at: https://github.com/appletea233/AL-Ref-SAM2.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Referring Video Segmentation | Ref-YouTube-VOS | J&F Score67.9 | 108 | |
| Referring Video Segmentation | MeViS | J&F Score42.8 | 81 | |
| Referring Video Object Segmentation | YoURVOS (test) | J&F26.3 | 40 | |
| Referring Video Object Segmentation | Ref-DAVIS | J&F Score74.2 | 21 | |
| Audio-Visual Segmentation | S4 | mIoU (MJ)70.5 | 11 | |
| Audio-Visual Segmentation | MS3 | mIoU (MJ)48.6 | 11 | |
| Audio-Visual Segmentation | AVSS V2 | MJ Score59.2 | 9 | |
| Audio-Visual Segmentation | AVSS Binary | mIoU (MJ)59.2 | 6 | |
| Audio-referred visual grounding | AVISeg (test) | FSLA18.55 | 4 |