AGC: Adaptive Geodesic Correction for Adversarial Robustness on Vision-Language Models
About
Vision-language models like CLIP have demonstrated remarkable zero-shot transfer capabilities. However, their susceptibility to imperceptible adversarial perturbations remains a critical security concern. While test-time defenses offer a pragmatic solution for deployed models, existing approaches typically rely on gradient-based optimization during inference, incurring significant computational overhead. In this paper, we revisit the role of data augmentation in CLIP robustness and observe that augmentations are not equally effective: specific augmentations consistently provide robust geometric cues that align with correct class semantics in the hyperspherical feature space. Based on this, we propose Adaptive Geodesic Correction (AGC), a training-free defense mechanism that requires no parameter updates. AGC identifies a reliable augmentation as a geometric anchor and corrects the input feature towards it, utilizing an adaptive step size to balance robustness against clean accuracy preservation. AGC achieves superior performance across eight fine-grained datasets and three CLIP backbones, improving average robust accuracy by 44.4\% over state-of-the-art baseline while delivering a 10$\times$ reduction in inference latency. Our findings reveal a fundamental geometric property of CLIP features, offering a highly efficient and effective paradigm for robust multimodal deployment.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Fine grained classification | EuroSAT | Accuracy52 | 109 | |
| Fine grained classification | UCF101 | Accuracy73.1 | 81 | |
| Fine grained classification | Stanford Cars | Accuracy67.3 | 74 | |
| Fine grained classification | Caltech101 | Accuracy94.4 | 60 | |
| Fine grained classification | Pets | Accuracy93.4 | 53 | |
| Fine-grained Image Classification | Oxford-IIIT Pets | Accuracy87 | 43 | |
| Fine grained classification | DTD | Clean Accuracy52.2 | 41 | |
| Fine grained classification | FGVC Aircraft | Accuracy24.2 | 39 | |
| Fine grained classification | Cars | Accuracy77.3 | 37 | |
| Fine grained classification | Describable Textures Dataset (DTD) | Accuracy43.4 | 37 |