Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models

About

Multi-modal foundation models like OpenFlamingo, LLaVA, and GPT-4 are increasingly used for various real-world tasks. Prior work has shown that these models are highly vulnerable to adversarial attacks on the vision modality. These attacks can be leveraged to spread fake information or defraud users, and thus pose a significant risk, which makes the robustness of large multi-modal foundation models a pressing problem. The CLIP model, or one of its variants, is used as a frozen vision encoder in many large vision-language models (LVLMs), e.g. LLaVA and OpenFlamingo. We propose an unsupervised adversarial fine-tuning scheme to obtain a robust CLIP vision encoder, which yields robustness on all vision down-stream tasks (LVLMs, zero-shot classification) that rely on CLIP. In particular, we show that stealth-attacks on users of LVLMs by a malicious third party providing manipulated images are no longer possible once one replaces the original CLIP model with our robust one. No retraining or fine-tuning of the down-stream LVLMs is required. The code and robust models are available at https://github.com/chs20/RobustVLM

Christian Schlarmann, Naman Deep Singh, Francesco Croce, Matthias Hein• 2024

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VQA v2	--	1429
Image Classification	ImageNet V2	Top-1 Acc49.8	767
Image Classification	ImageNet A	Top-1 Acc9.2	723
Visual Question Answering	ScienceQA	Accuracy67.9	525
Image-to-Text Retrieval	Flickr30K	R@178.3	451
Image Classification	SUN397	Accuracy58.75	450
Image Classification	CIFAR100	Accuracy55.21	301
Image Classification	OxfordPets	Accuracy87.46	298
Image Classification	FGVCAircraft	Accuracy18.69	289
Image Classification	CIFAR10	Accuracy (%)85.45	282

Showing 10 of 274 rows

...

Other info

Follow for update

@wizwand_team Discord