YOLOv12: Attention-Centric Real-Time Object Detectors

About

Enhancing the network architecture of the YOLO framework has been crucial for a long time, but has focused on CNN-based improvements despite the proven superiority of attention mechanisms in modeling capabilities. This is because attention-based models cannot match the speed of CNN-based models. This paper proposes an attention-centric YOLO framework, namely YOLOv12, that matches the speed of previous CNN-based ones while harnessing the performance benefits of attention mechanisms. YOLOv12 surpasses all popular real-time object detectors in accuracy with competitive speed. For example, YOLOv12-N achieves 40.6% mAP with an inference latency of 1.64 ms on a T4 GPU, outperforming advanced YOLOv10-N / YOLOv11-N by 2.1%/1.2% mAP with a comparable speed. This advantage extends to other model scales. YOLOv12 also surpasses end-to-end real-time detectors that improve DETR, such as RT-DETR / RT-DETRv2: YOLOv12-S beats RT-DETR-R18 / RT-DETRv2-R18 while running 42% faster, using only 36% of the computation and 45% of the parameters. More comparisons are shown in Figure 1.

Yunjie Tian, Qixiang Ye, David Doermann• 2025

Related benchmarks

Task	Dataset	Result
Object Detection	COCO 2017 (val)	AP55.7	2843
Instance Segmentation	COCO 2017 (val)	--	1275
Image Classification	ImageNet (val)	Top-1 Accuracy71.7	354
Object Detection	MS-COCO (val)	mAP0.526	222
Object Detection	MS-COCO	--	208
Object Detection	VisDrone (val)	AP5045.7	114
Object Detection	PASCAL VOC 2007+2012 (test)	mAP (mean Average Precision)60.7	95
Object Detection	AI-TOD (test)	AP@0.544.7	93
Object Detection	HRIPCB	mAP5097.1	84
Object Detection	DeepPCB 1.0 (test)	F1 Score96.1	66

Showing 10 of 79 rows

...

Other info

Follow for update

@wizwand_team Discord