VLOD-TTA: Test-Time Adaptation of Vision-Language Object Detectors

About

Vision-language object detectors (VLODs) such as YOLO-World and Grounding DINO exhibit strong zero-shot generalization, but their performance degrades under distribution shift. Test-time adaptation (TTA) offers a practical way to adapt models online using only unlabeled target data. However, despite substantial progress in TTA for vision-language classification, TTA for VLODs remains largely unexplored. The only prior method relies on a mean-teacher framework that introduces significant latency and memory overhead. To this end, we introduce \textsc{VLOD-TTA}, a TTA method that leverages dense proposal overlap and image-conditioned prompts to adapt VLODs with low additional overhead. \textsc{VLOD-TTA} combines (i) an IoU-weighted entropy objective that emphasizes spatially coherent proposal clusters and mitigates confirmation bias from isolated boxes, and (ii) image-conditioned prompt selection that ranks prompts by image-level compatibility and aggregates the most informative prompt scores for detection. Our experiments across diverse distribution shifts, including artistic domains, adverse driving conditions, low-light imagery, and common corruptions, indicate that \textsc{VLOD-TTA} consistently outperforms standard TTA baselines and the prior state-of-the-art method using YOLO-World and Grounding DINO. Code : https://github.com/imatif17/VLOD-TTA

Atif Belal, Heitor R. Medeiros, Marco Pedersoli, Eric Granger• 2025

Related benchmarks

Task	Dataset	Result
Object Detection	Cityscapes	mAP25.8	136
Object Detection	BDD100K	mAP18.1	88
Object Detection	ExDark	mAP (Mean Average Precision)37.3	58
Object Detection	PASCAL C	mAP (Gaussian)15.3	35
Object Detection	COCO-C	mAP27.3	26
Object Detection	Comic	AP5067.1	18
Object Detection	Watercolor	AP5072.8	18
Object Detection	Aquarium	--	4

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord