RegionReasoner: Region-Grounded Multi-Round Visual Reasoning
About
Large vision-language models have achieved remarkable progress in visual reasoning, yet most existing systems rely on single-step or text-only reasoning, limiting their ability to iteratively refine understanding across multiple visual contexts. To address this limitation, we introduce a new multi-round visual reasoning benchmark with training and test sets spanning both detection and segmentation tasks, enabling systematic evaluation under iterative reasoning scenarios. We further propose RegionReasoner, a reinforcement learning framework that enforces grounded reasoning by requiring each reasoning trace to explicitly cite the corresponding reference bounding boxes, while maintaining semantic coherence via a global-local consistency reward. This reward extracts key objects and nouns from both global scene captions and region-level captions, aligning them with the reasoning trace to ensure consistency across reasoning steps. RegionReasoner is optimized with structured rewards combining grounding fidelity and global-local semantic alignment. Experiments on detection and segmentation tasks show that RegionReasoner-7B, together with our newly introduced benchmark RegionDial-Bench, considerably improves multi-round reasoning accuracy, spatial grounding precision, and global-local consistency, establishing a strong baseline for this emerging research direction.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Referring Expression Comprehension | RefCOCOg (test) | -- | 291 | |
| Referring Expression Segmentation | RefCOCOg (test) | -- | 78 | |
| Referring Segmentation | RegionDial-Bench RefCOCOg Multi-turn 1.0 (test) | R1 gIoU73.9 | 6 | |
| Referring Segmentation | RegionDial-Bench RefCOCO+ Multi-turn 1.0 (test) | R1 (gIoU)76.4 | 6 | |
| Referring Detection | RegionDial-Bench RefCOCOg Multi-turn (test) | R187.1 | 5 | |
| Visual Search | V* Benchmark | Attribute Success Rate75.65 | 5 | |
| Referring Detection | RegionDial-Bench RefCOCO+ Multi-turn (test) | R189.3 | 5 | |
| Referring Expression Segmentation | RefCOCO+ (test) | gIoU76.9 | 4 | |
| Referring Expression Comprehension | RefCOCO+ (test) | -- | 4 |