Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Deep Instruction Tuning for Segment Anything Model

About

Recently, Segment Anything Model (SAM) has become a research hotspot in the fields of multimedia and computer vision, which exhibits powerful yet versatile capabilities on various (un) conditional image segmentation tasks. Although SAM can support different types of segmentation prompts, we note that, compared to point- and box-guided segmentations, it performs much worse on text-instructed tasks, e.g., referring image segmentation (RIS). In this paper, we argue that deep text instruction tuning is key to mitigate such shortcoming caused by the shallow fusion scheme in its default light-weight mask decoder. To address this issue, we propose two simple yet effective deep instruction tuning (DIT) methods for SAM, one is end-to-end and the other is layer-wise. With minimal modifications, DITs can directly transform the image encoder of SAM as a stand-alone vision-language learner in contrast to building another deep fusion branch, maximizing the benefit of its superior segmentation capability. Extensive experiments on three highly competitive benchmark datasets of RIS show that a simple end-to-end DIT can improve SAM by a large margin, while the layer-wise DIT can further boost the performance to state-of-the-art with much less data and training expenditures. Our code is released at: https://github.com/wysnzzzz/DIT.

Xiaorui Huang, Gen Luo, Chaoyang Zhu, Bo Tong, Yiyi Zhou, Xiaoshuai Sun, Rongrong Ji• 2024

Related benchmarks

TaskDatasetResultRank
Referring Image SegmentationRefCOCO (val)
mIoU71.98
259
Referring Image SegmentationRefCOCO+ (test-B)
mIoU51.72
252
Referring Image SegmentationRefCOCO (test A)
mIoU74.51
230
Referring Image SegmentationRefCOCO+ (val)
mIoU59.97
179
Referring Image SegmentationRefCOCO (test-B)
mIoU68.77
171
Referring Image SegmentationRefCOCOg (val)--
100
Referring Image SegmentationRefCOCO+ (testA)
mIoU65.52
97
Referring Image SegmentationRefCOCOg (test)--
61
Referring Image SegmentationRefCOCOg (val (U))
mIoU60.18
54
Referring Image SegmentationRefCOCOg (test(U))
mIoU61.15
54
Showing 10 of 11 rows

Other info

Follow for update