Heuristic-inspired Reasoning Priors Facilitate Data-Efficient Referring Object Detection

About

Most referring object detection (ROD) models, especially the modern grounding detectors, are designed for data-rich conditions, yet many practical deployments, such as robotics, augmented reality, and other specialized domains, would face severe label scarcity. In such regimes, end-to-end grounding detectors need to learn spatial and semantic structure from scratch, wasting precious samples. We ask a simple question: Can explicit reasoning priors help models learn more efficiently when data is scarce? To explore this, we first introduce a Data-efficient Referring Object Detection (De-ROD) task, which is a benchmark protocol for measuring ROD performance in low-data and few-shot settings. We then propose the HeROD (Heuristic-inspired ROD), a lightweight, model-agnostic framework that injects explicit, heuristic-inspired spatial and semantic reasoning priors, which are interpretable signals derived based on the referring phrase, into 3 stages of a modern DETR-style pipeline: proposal ranking, prediction fusion, and Hungarian matching. By biasing both training and inference toward plausible candidates, these priors promise to improve label efficiency and convergence performance. On RefCOCO, RefCOCO+, and RefCOCOg, HeROD consistently outperforms strong grounding baselines in scarce-label regimes. More broadly, our results suggest that integrating simple, interpretable reasoning priors provides a practical and extensible path toward better data-efficient vision-language understanding.

Xu Zhang, Zhe Chen, Jing Zhang, Dacheng Tao• 2026

Related benchmarks

Task	Dataset	Result
Referring Image Segmentation	RefCOCO (val)	mIoU78.21	274
Referring Image Segmentation	RefCOCO+ (test-B)	mIoU63.86	267
Referring Image Segmentation	RefCOCO (test A)	mIoU80.16	245
Referring Image Segmentation	RefCOCO+ (val)	mIoU71.52	194
Referring Image Segmentation	RefCOCO (test-B)	mIoU74.3	186
Referring Image Segmentation	RefCOCOg (val)	--	114
Referring Image Segmentation	RefCOCO+ (testA)	mIoU76.86	112
Referring Image Segmentation	RefCOCOg (test)	--	75
Referring Object Detection	RefCOCO (val)	Top-1 Accuracy89.57	28
Referring Object Detection	RefCOCO (testB)	Top-1 Accuracy86.06	28

Showing 10 of 22 rows

Other info

Follow for update

@wizwand_team Discord