Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VLOD-TTA: Test-Time Adaptation of Vision-Language Object Detectors

About

Vision-language object detectors (VLODs) such as YOLO-World and Grounding DINO exhibit strong zero-shot generalization, but their performance degrades under distribution shift. Test-time adaptation (TTA) offers a practical way to adapt models online using only unlabeled target data. However, despite substantial progress in TTA for vision-language classification, TTA for VLODs remains largely unexplored. The only prior method relies on a mean-teacher framework that introduces significant latency and memory overhead. To this end, we introduce \textsc{VLOD-TTA}, a TTA method that leverages dense proposal overlap and image-conditioned prompts to adapt VLODs with low additional overhead. \textsc{VLOD-TTA} combines (i) an IoU-weighted entropy objective that emphasizes spatially coherent proposal clusters and mitigates confirmation bias from isolated boxes, and (ii) image-conditioned prompt selection that ranks prompts by image-level compatibility and aggregates the most informative prompt scores for detection. Our experiments across diverse distribution shifts, including artistic domains, adverse driving conditions, low-light imagery, and common corruptions, indicate that \textsc{VLOD-TTA} consistently outperforms standard TTA baselines and the prior state-of-the-art method using YOLO-World and Grounding DINO. Code : https://github.com/imatif17/VLOD-TTA

Atif Belal, Heitor R. Medeiros, Marco Pedersoli, Eric Granger• 2025

Related benchmarks

TaskDatasetResultRank
Object DetectionExDark
mAP (Mean Average Precision)37.3
58
Object DetectionCityscapes
AP5040.8
32
Object DetectionBDD100K
mAP18.1
31
Object DetectionCOCO-C
mAP27.3
26
Object DetectionComic
AP5067.1
18
Object DetectionWatercolor
AP5072.8
18
Object DetectionPASCAL C
mAP (Gaussian)15.3
18
Object DetectionAquarium--
4
Showing 8 of 8 rows

Other info

Follow for update