Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DOTA: Distributional Test-Time Adaptation of Vision-Language Models

About

Vision-language foundation models (VLMs), such as CLIP, exhibit remarkable performance across a wide range of tasks. However, deploying these models can be unreliable when significant distribution gaps exist between training and test data, while fine-tuning for diverse scenarios is often costly. Cache-based test-time adapters offer an efficient alternative by storing representative test samples to guide subsequent classifications. Yet, these methods typically employ naive cache management with limited capacity, leading to severe catastrophic forgetting when samples are inevitably dropped during updates. In this paper, we propose DOTA (DistributiOnal Test-time Adaptation), a simple yet effective method addressing this limitation. Crucially, instead of merely memorizing individual test samples, DOTA continuously estimates the underlying distribution of the test data stream. Test-time posterior probabilities are then computed using these dynamically estimated distributions via Bayes' theorem for adaptation. This distribution-centric approach enables the model to continually learn and adapt to the deployment environment. Extensive experiments validate that DOTA significantly mitigates forgetting and achieves state-of-the-art performance compared to existing methods.

Zongbo Han, Jialong Yang, Guangyu Wang, Junfan Li, Qianli Xu, Mike Zheng Shou, Changqing Zhang• 2024

Related benchmarks

TaskDatasetResultRank
Fine-grained visual classificationFGVC-Aircraft (test)
Top-1 Acc25.59
312
Fine grained classificationEuroSAT
Accuracy47.15
81
Fine-grained Visual CategorizationFGVCAircraft
Accuracy18.06
74
Fine grained classificationUCF101
Accuracy65.08
53
Few-shot classificationCIFAR FS (test)--
51
Fine grained classificationStanford Cars
Accuracy58.72
50
Fine grained classificationFood101
Top-1 Acc78.61
42
Fine grained classificationSUN397
Top-1 Accuracy63.89
39
Image ClassificationImageNet A, V, R, S (val)
ImageNet Accuracy70.68
38
Fine grained classificationOxford Flowers 102
Accuracy68.53
31
Showing 10 of 13 rows

Other info

Follow for update