Heuristic-inspired Reasoning Priors Facilitate Data-Efficient Referring Object Detection
About
Most referring object detection (ROD) models, especially the modern grounding detectors, are designed for data-rich conditions, yet many practical deployments, such as robotics, augmented reality, and other specialized domains, would face severe label scarcity. In such regimes, end-to-end grounding detectors need to learn spatial and semantic structure from scratch, wasting precious samples. We ask a simple question: Can explicit reasoning priors help models learn more efficiently when data is scarce? To explore this, we first introduce a Data-efficient Referring Object Detection (De-ROD) task, which is a benchmark protocol for measuring ROD performance in low-data and few-shot settings. We then propose the HeROD (Heuristic-inspired ROD), a lightweight, model-agnostic framework that injects explicit, heuristic-inspired spatial and semantic reasoning priors, which are interpretable signals derived based on the referring phrase, into 3 stages of a modern DETR-style pipeline: proposal ranking, prediction fusion, and Hungarian matching. By biasing both training and inference toward plausible candidates, these priors promise to improve label efficiency and convergence performance. On RefCOCO, RefCOCO+, and RefCOCOg, HeROD consistently outperforms strong grounding baselines in scarce-label regimes. More broadly, our results suggest that integrating simple, interpretable reasoning priors provides a practical and extensible path toward better data-efficient vision-language understanding.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Referring Image Segmentation | RefCOCO (val) | mIoU78.21 | 259 | |
| Referring Image Segmentation | RefCOCO+ (test-B) | mIoU63.86 | 252 | |
| Referring Image Segmentation | RefCOCO (test A) | mIoU80.16 | 230 | |
| Referring Image Segmentation | RefCOCO+ (val) | mIoU71.52 | 179 | |
| Referring Image Segmentation | RefCOCO (test-B) | mIoU74.3 | 171 | |
| Referring Image Segmentation | RefCOCOg (val) | -- | 100 | |
| Referring Image Segmentation | RefCOCO+ (testA) | mIoU76.86 | 97 | |
| Referring Image Segmentation | RefCOCOg (test) | -- | 61 | |
| Referring Object Detection | RefCOCO (val) | Top-1 Accuracy89.57 | 28 | |
| Referring Object Detection | RefCOCO (testB) | Top-1 Accuracy86.06 | 28 |