VRP-SAM: SAM with Visual Reference Prompt
About
In this paper, we propose a novel Visual Reference Prompt (VRP) encoder that empowers the Segment Anything Model (SAM) to utilize annotated reference images as prompts for segmentation, creating the VRP-SAM model. In essence, VRP-SAM can utilize annotated reference images to comprehend specific objects and perform segmentation of specific objects in target image. It is note that the VRP encoder can support a variety of annotation formats for reference images, including \textbf{point}, \textbf{box}, \textbf{scribble}, and \textbf{mask}. VRP-SAM achieves a breakthrough within the SAM framework by extending its versatility and applicability while preserving SAM's inherent strengths, thus enhancing user-friendliness. To enhance the generalization ability of VRP-SAM, the VRP encoder adopts a meta-learning strategy. To validate the effectiveness of VRP-SAM, we conducted extensive empirical studies on the Pascal and COCO datasets. Remarkably, VRP-SAM achieved state-of-the-art performance in visual reference segmentation with minimal learnable parameters. Furthermore, VRP-SAM demonstrates strong generalization capabilities, allowing it to perform segmentation of unseen objects and enabling cross-domain segmentation. The source code and models will be available at https://github.com/syp2ysy/VRP-SAM
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Object Segmentation | DAVIS 2017 (val) | J mean62.1 | 1130 | |
| Few-shot Semantic Segmentation | COCO-20i | mIoU53.9 | 115 | |
| Few-shot Semantic Segmentation | PASCAL-5i | mIoU71.9 | 96 | |
| Semantic segmentation | COCO-20i (test) | Mean Score60.4 | 70 | |
| Semantic segmentation | PASCAL 1-shot 5i | mIoU (fold1)78.3 | 57 | |
| Semantic segmentation | COCO 20i 1-shot | Fold 0 Score48.1 | 41 | |
| Few-shot Semantic Segmentation | COCO-20i binary | mIoU53.9 | 14 | |
| Face Segmentation | Authors' Face Occlusion Dataset (test) | Occlusion IoU47.4 | 13 | |
| Part Segmentation | PASCAL-Part | mIoU36.2 | 10 | |
| One-shot semantic segmentation | COCO-20i (novel) | F-Score (Fold 0)48.1 | 9 |