SAM-R1: Leveraging SAM for Reward Feedback in Multimodal Segmentation via Reinforcement Learning
About
Leveraging multimodal large models for image segmentation has become a prominent research direction. However, existing approaches typically rely heavily on manually annotated datasets that include explicit reasoning processes, which are costly and time-consuming to produce. Recent advances suggest that reinforcement learning (RL) can endow large models with reasoning capabilities without requiring such reasoning-annotated data. In this paper, we propose SAM-R1, a novel framework that enables multimodal large models to perform fine-grained reasoning in image understanding tasks. Our approach is the first to incorporate fine-grained segmentation settings during the training of multimodal reasoning models. By integrating task-specific, fine-grained rewards with a tailored optimization objective, we further enhance the model's reasoning and segmentation alignment. We also leverage the Segment Anything Model (SAM) as a strong and flexible reward provider to guide the learning process. With only 3k training samples, SAM-R1 achieves strong performance across multiple benchmarks, demonstrating the effectiveness of reinforcement learning in equipping multimodal models with segmentation-oriented reasoning capabilities.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Referring Expression Segmentation | RefCOCO (testA) | -- | 217 | |
| Referring Expression Segmentation | RefCOCO+ (testA) | -- | 190 | |
| Reasoning Segmentation | ReasonSeg (val) | cIoU55.8 | 145 | |
| Reasoning Segmentation | ReasonSeg (test) | gIoU60.2 | 102 | |
| Referring Segmentation | RefCOCO (val) | cIoU79.2 | 51 | |
| Referring Expression Segmentation | RefCOCOg UMD (test) | mIoU73.1 | 13 | |
| Socio-name Segmentation | SocioSeg (test) | cIoU25.6 | 10 | |
| Socio-semantic Segmentation | SocioSeg (test) | cIoU22.5 | 10 | |
| Socio-semantic Segmentation | SocioSeg OOD (New Region) | cIoU0.148 | 10 | |
| Socio-class Segmentation | SocioSeg (test) | cIoU22.3 | 10 |