Described Object Detection: Liberating Object Detection with Flexible Expressions

About

Detecting objects based on language information is a popular task that includes Open-Vocabulary object Detection (OVD) and Referring Expression Comprehension (REC). In this paper, we advance them to a more practical setting called Described Object Detection (DOD) by expanding category names to flexible language expressions for OVD and overcoming the limitation of REC only grounding the pre-existing object. We establish the research foundation for DOD by constructing a Description Detection Dataset ($D^3$). This dataset features flexible language expressions, whether short category names or long descriptions, and annotating all described objects on all images without omission. By evaluating previous SOTA methods on $D^3$, we find some troublemakers that fail current REC, OVD, and bi-functional methods. REC methods struggle with confidence scores, rejecting negative instances, and multi-target scenarios, while OVD methods face constraints with long and complex descriptions. Recent bi-functional methods also do not work well on DOD due to their separated training procedures and inference strategies for REC and OVD tasks. Building upon the aforementioned findings, we propose a baseline that largely improves REC methods by reconstructing the training data and introducing a binary classification sub-task, outperforming existing methods. Data and code are available at https://github.com/shikras/d-cube and related works are tracked in https://github.com/Charles-Xie/awesome-described-object-detection.

Chi Xie, Zhao Zhang, Yixuan Wu, Feng Zhu, Rui Zhao, Shuang Liang• 2023

Related benchmarks

Task	Dataset	Result
Object Detection	D3	Full Score21.6	35
Dynamic Object Detection	D³ (Full)	Intra-scenario mAP21.6	20
Referring Expression Grounding	D^3 Intra-scenario	Full Success Rate21.6	20
Diverse Object Detection	D3 Intra-scenario	mAP (FULL)21.6	10
Diverse Object Detection	D3 (Inter-scenario)	mAP (FULL)5.7	10
Dynamic Object Detection	D³ (Present)	mAP (Intra-scenario)23.7	10
Visual Grounding	D3 Intra-scenario	APb (Full)21.6	10
Visual Grounding	D3 (Inter-scenario)	APb (Full)570	10

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord