DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

About

We present DINO (\textbf{D}ETR with \textbf{I}mproved de\textbf{N}oising anch\textbf{O}r boxes), a state-of-the-art end-to-end object detector. % in this paper. DINO improves over previous DETR-like models in performance and efficiency by using a contrastive way for denoising training, a mixed query selection method for anchor initialization, and a look forward twice scheme for box prediction. DINO achieves $49.4$AP in $12$ epochs and $51.3$AP in $24$ epochs on COCO with a ResNet-50 backbone and multi-scale features, yielding a significant improvement of $\textbf{+6.0}$\textbf{AP} and $\textbf{+2.7}$\textbf{AP}, respectively, compared to DN-DETR, the previous best DETR-like model. DINO scales well in both model size and data size. Without bells and whistles, after pre-training on the Objects365 dataset with a SwinL backbone, DINO obtains the best results on both COCO \texttt{val2017} ($\textbf{63.2}$\textbf{AP}) and \texttt{test-dev} (\textbf{$\textbf{63.3}$AP}). Compared to other models on the leaderboard, DINO significantly reduces its model size and pre-training data size while achieving better results. Our code will be available at \url{https://github.com/IDEACVR/DINO}.

Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M. Ni, Heung-Yeung Shum• 2022

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K (val)	mIoU59.5	3069
Object Detection	COCO 2017 (val)	AP63.2	2843
Object Detection	COCO (test-dev)	mAP65.5	1239
Object Detection	MS COCO (test-dev)	--	677
Object Detection	COCO (val)	mAP58.5	637
Object Detection	LVIS v1.0 (val)	APbbox28.8	542
Object Detection	COCO v2017 (test-dev)	mAP63.3	499
Referring Expression Comprehension	RefCOCO+ (val)	Accuracy82.75	354
Referring Expression Comprehension	RefCOCO (val)	Accuracy90.56	348
Referring Expression Comprehension	RefCOCO (testA)	Accuracy0.9319	346

Showing 10 of 126 rows

...

Other info

Code

Follow for update

@wizwand_team Discord