A Provable Energy-Guided Test-Time Defense Boosting Adversarial Robustness of Large Vision-Language Models
About
Despite the rapid progress in multimodal models and Large Visual-Language Models (LVLM), they remain highly susceptible to adversarial perturbations, raising serious concerns about their reliability in real-world use. While adversarial training has become the leading paradigm for building models that are robust to adversarial attacks, Test-Time Transformations (TTT) have emerged as a promising strategy to boost robustness at inference. In light of this, we propose Energy-Guided Test-Time Transformation (ET3), a lightweight, training-free defense that enhances the robustness by minimizing the energy of the input samples. Our method is grounded in a theory that proves our transformation succeeds in classification under reasonable assumptions. We present extensive experiments demonstrating that ET3 provides a strong defense for classifiers, zero-shot classification with CLIP, and also for boosting the robustness of LVLMs in tasks such as Image Captioning and Visual Question Answering. Code is available at github.com/OmnAI-Lab/Energy-Guided-Test-Time-Defense .
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Fine grained classification | EuroSAT | Accuracy13.37 | 81 | |
| Fine grained classification | UCF101 | Accuracy37.35 | 53 | |
| Fine grained classification | Caltech101 | Accuracy79.27 | 39 | |
| Fine grained classification | DTD | Clean Accuracy26.42 | 34 | |
| Fine grained classification | Pets | Accuracy66.86 | 32 | |
| Zero-shot Image Classification | 14 Robustness Benchmark Datasets (ImageNet, CalTech, Cars, CIFAR10, CIFAR100, DTD, EuroSAT, FGVC, Flowers, ImageNet-R, ImageNet-S, PCAM, OxfordPets, STL-10) (test) | ImageNet Accuracy80.11 | 16 | |
| Zero-shot Image Classification | ImageNet 1k (test) | Accuracy (Zero-shot)79.82 | 16 | |
| Fine grained classification | Cars | Accuracy10.32 | 16 | |
| Fine grained classification | Aircraft | Accuracy5.85 | 16 | |
| Image Captioning | COCO Clean (test) | CIDEr115.5 | 10 |