Rethinking Misalignment in Vision-Language Model Adaptation from a Causal Perspective
About
Foundational Vision-Language models such as CLIP have exhibited impressive generalization in downstream tasks. However, CLIP suffers from a two-level misalignment issue, i.e., task misalignment and data misalignment, when adapting to specific tasks. Soft prompt tuning has mitigated the task misalignment, yet the data misalignment remains a challenge. To analyze the impacts of the data misalignment, we revisit the pre-training and adaptation processes of CLIP and develop a structural causal model. We discover that while we expect to capture task-relevant information for downstream tasks accurately, the task-irrelevant knowledge impacts the prediction results and hampers the modeling of the true relationships between the images and the predicted classes. As task-irrelevant knowledge is unobservable, we leverage the front-door adjustment and propose Causality-Guided Semantic Decoupling and Classification (CDC) to mitigate the interference of task-irrelevant knowledge. Specifically, we decouple semantics contained in the data of downstream tasks and perform classification based on each semantic. Furthermore, we employ the Dempster-Shafer evidence theory to evaluate the uncertainty of each prediction generated by diverse semantics. Experiments conducted in multiple different settings have consistently demonstrated the effectiveness of CDC.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | SUN | Harmonic Mean Top-1 Accuracy81.18 | 86 | |
| Image Classification | DTD | Base Score82.7 | 79 | |
| Image Classification | UCF101 | Base Classes Acc85.7 | 62 | |
| Image Classification | ImageNet to 10 Target Datasets (Caltech101, OxfordPets, StanfordCars, Flowers102, Food101, FGVCAircraft, SUN397, DTD, EuroSAT, UCF101) (test) | ImageNet Accuracy71.76 | 48 | |
| Image Classification | Aircraft | Base Accuracy37.47 | 19 | |
| Image Classification | ImageNet and its cross-domain variants (ImageNetV2, ImageNet-S, ImageNet-A, ImageNet-R) (test) | ImageNet-S Accuracy50.33 | 9 | |
| Image Classification | SAT | -- | 7 | |
| Image Classification | ImageNet | Base Accuracy77.5 | 4 | |
| Image Classification | Caltech | Base Accuracy98.2 | 4 | |
| Image Classification | Pets | Base Accuracy96.07 | 4 |