VLOD-TTA: Test-Time Adaptation of Vision-Language Object Detectors
About
Vision-language object detectors (VLODs) such as YOLO-World and Grounding DINO exhibit strong zero-shot generalization, but their performance degrades under distribution shift. Test-time adaptation (TTA) offers a practical way to adapt models online using only unlabeled target data. However, despite substantial progress in TTA for vision-language classification, TTA for VLODs remains largely unexplored. The only prior method relies on a mean-teacher framework that introduces significant latency and memory overhead. To this end, we introduce \textsc{VLOD-TTA}, a TTA method that leverages dense proposal overlap and image-conditioned prompts to adapt VLODs with low additional overhead. \textsc{VLOD-TTA} combines (i) an IoU-weighted entropy objective that emphasizes spatially coherent proposal clusters and mitigates confirmation bias from isolated boxes, and (ii) image-conditioned prompt selection that ranks prompts by image-level compatibility and aggregates the most informative prompt scores for detection. Our experiments across diverse distribution shifts, including artistic domains, adverse driving conditions, low-light imagery, and common corruptions, indicate that \textsc{VLOD-TTA} consistently outperforms standard TTA baselines and the prior state-of-the-art method using YOLO-World and Grounding DINO. Code : https://github.com/imatif17/VLOD-TTA
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Detection | ExDark | mAP (Mean Average Precision)37.3 | 58 | |
| Object Detection | Cityscapes | AP5040.8 | 32 | |
| Object Detection | BDD100K | mAP18.1 | 31 | |
| Object Detection | COCO-C | mAP27.3 | 26 | |
| Object Detection | Comic | AP5067.1 | 18 | |
| Object Detection | Watercolor | AP5072.8 | 18 | |
| Object Detection | PASCAL C | mAP (Gaussian)15.3 | 18 | |
| Object Detection | Aquarium | -- | 4 |