Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

YOLO-World: Real-Time Open-Vocabulary Object Detection

About

The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools. However, their reliance on predefined and trained object categories limits their applicability in open scenarios. Addressing this limitation, we introduce YOLO-World, an innovative approach that enhances YOLO with open-vocabulary detection capabilities through vision-language modeling and pre-training on large-scale datasets. Specifically, we propose a new Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) and region-text contrastive loss to facilitate the interaction between visual and linguistic information. Our method excels in detecting a wide range of objects in a zero-shot manner with high efficiency. On the challenging LVIS dataset, YOLO-World achieves 35.4 AP with 52.0 FPS on V100, which outperforms many state-of-the-art methods in terms of both accuracy and speed. Furthermore, the fine-tuned YOLO-World achieves remarkable performance on several downstream tasks, including object detection and open-vocabulary instance segmentation.

Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, Ying Shan• 2024

Related benchmarks

TaskDatasetResultRank
Object DetectionCOCO 2017 (val)
AP53.3
2843
Object DetectionCOCO (val)--
637
Object DetectionCOCO 2017
AP (Box)53.3
345
Object DetectionCOCO
AP50 (Box)59.8
237
Object DetectionLVIS (val)
mAP33.3
170
Object DetectionLVIS (minival)
AP35.4
159
Object DetectionCityscapes
mAP24.1
136
Object DetectionLVIS mini (val)
mAP35.4
120
Object DetectionODinW-13
AP38.4
98
Object DetectionPascal VOC -> Clipart (test)
mAP46.2
91
Showing 10 of 66 rows

Other info

Follow for update