Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

RegionReasoner: Region-Grounded Multi-Round Visual Reasoning

About

Large vision-language models have achieved remarkable progress in visual reasoning, yet most existing systems rely on single-step or text-only reasoning, limiting their ability to iteratively refine understanding across multiple visual contexts. To address this limitation, we introduce a new multi-round visual reasoning benchmark with training and test sets spanning both detection and segmentation tasks, enabling systematic evaluation under iterative reasoning scenarios. We further propose RegionReasoner, a reinforcement learning framework that enforces grounded reasoning by requiring each reasoning trace to explicitly cite the corresponding reference bounding boxes, while maintaining semantic coherence via a global-local consistency reward. This reward extracts key objects and nouns from both global scene captions and region-level captions, aligning them with the reasoning trace to ensure consistency across reasoning steps. RegionReasoner is optimized with structured rewards combining grounding fidelity and global-local semantic alignment. Experiments on detection and segmentation tasks show that RegionReasoner-7B, together with our newly introduced benchmark RegionDial-Bench, considerably improves multi-round reasoning accuracy, spatial grounding precision, and global-local consistency, establishing a strong baseline for this emerging research direction.

Wenfang Sun, Hao Chen, Yingjun Du, Yefeng Zheng, Cees G. M. Snoek• 2026

Related benchmarks

TaskDatasetResultRank
Referring Expression ComprehensionRefCOCOg (test)--
291
Referring Expression SegmentationRefCOCOg (test)--
78
Referring SegmentationRegionDial-Bench RefCOCOg Multi-turn 1.0 (test)
R1 gIoU73.9
6
Referring SegmentationRegionDial-Bench RefCOCO+ Multi-turn 1.0 (test)
R1 (gIoU)76.4
6
Referring DetectionRegionDial-Bench RefCOCOg Multi-turn (test)
R187.1
5
Visual SearchV* Benchmark
Attribute Success Rate75.65
5
Referring DetectionRegionDial-Bench RefCOCO+ Multi-turn (test)
R189.3
5
Referring Expression SegmentationRefCOCO+ (test)
gIoU76.9
4
Referring Expression ComprehensionRefCOCO+ (test)--
4
Showing 9 of 9 rows

Other info

Follow for update