Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces

About

The reference-based object segmentation tasks, namely referring image segmentation (RIS), few-shot image segmentation (FSS), referring video object segmentation (RVOS), and video object segmentation (VOS), aim to segment a specific object by utilizing either language or annotated masks as references. Despite significant progress in each respective field, current methods are task-specifically designed and developed in different directions, which hinders the activation of multi-task capabilities for these tasks. In this work, we end the current fragmented situation and propose UniRef++ to unify the four reference-based object segmentation tasks with a single architecture. At the heart of our approach is the proposed UniFusion module which performs multiway-fusion for handling different tasks with respect to their specified references. And a unified Transformer architecture is then adopted for achieving instance-level segmentation. With the unified designs, UniRef++ can be jointly trained on a broad range of benchmarks and can flexibly complete multiple tasks at run-time by specifying the corresponding references. We evaluate our unified models on various benchmarks. Extensive experimental results indicate that our proposed UniRef++ achieves state-of-the-art performance on RIS and RVOS, and performs competitively on FSS and VOS with a parameter-shared network. Moreover, we showcase that the proposed UniFusion module could be easily incorporated into the current advanced foundation model SAM and obtain satisfactory results with parameter-efficient finetuning. Codes and models are available at \url{https://github.com/FoundationVision/UniRef}.

Jiannan Wu, Yi Jiang, Bin Yan, Huchuan Lu, Zehuan Yuan, Ping Luo• 2023

Related benchmarks

TaskDatasetResultRank
Video Object SegmentationDAVIS 2017 (val)
J mean80.8
1193
Video Object SegmentationYouTube-VOS 2018 (val)
J Score (Seen)83.8
493
Referring Image SegmentationRefCOCO (val)
mIoU81.9
259
Referring Expression SegmentationRefCOCO (testA)
cIoU82.1
257
Referring Image SegmentationRefCOCO+ (test-B)
mIoU68.33
252
Referring Video Object SegmentationRef-YouTube-VOS (val)
J&F Score67.4
244
Video Object SegmentationYouTube-VOS 2019 (val)
J-Score (Seen)83.1
231
Referring Image SegmentationRefCOCO (test A)
mIoU83.48
230
Referring Expression SegmentationRefCOCO+ (testA)
cIoU74
230
Referring Expression SegmentationRefCOCO+ (val)
cIoU68.4
223
Showing 10 of 27 rows

Other info

Code

Follow for update