LAMP: Learning Universal Adversarial Perturbations for Multi-Image Tasks via Pre-trained Models
About
Multimodal Large Language Models (MLLMs) have achieved remarkable performance across vision-language tasks. Recent advancements allow these models to process multiple images as inputs. However, the vulnerabilities of multi-image MLLMs remain unexplored. Existing adversarial attacks focus on single-image settings and often assume a white-box threat model, which is impractical in many real-world scenarios. This paper introduces LAMP, a black-box method for learning Universal Adversarial Perturbations (UAPs) targeting multi-image MLLMs. LAMP applies an attention-based constraint that prevents the model from effectively aggregating information across images. LAMP also introduces a novel cross-image contagious constraint that forces perturbed tokens to influence clean tokens, spreading adversarial effects without requiring all inputs to be modified. Additionally, an index-attention suppression loss enables a robust position-invariant attack. Experimental results show that LAMP outperforms SOTA baselines and achieves the highest attack success rates across multiple vision-language tasks and models.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Adversarial Attack | Mantis-Eval | Attack Success Rate84.57 | 37 | |
| Adversarial Attack | NLVR2 | Attack Success Rate67.51 | 37 | |
| Adversarial Attack | BLINK | Attack Success Rate (ASR)87.65 | 37 | |
| Adversarial Attack | Q-Bench | Attack Success Rate87.23 | 37 | |
| Adversarial Attack | MVBench | ASR83.84 | 37 | |
| Visual Question Answering | MM-Vet | -- | 27 | |
| Visual Question Answering | OK-VQA | VQA Score80.7 | 18 | |
| Visual Question Answering | LLaVA-Bench | VQA ASR68.31 | 12 | |
| Visual Question Answering | Mantis-Eval | ASR71.32 | 12 | |
| Image Captioning | MS-COCO | ASR (Average Sentence Rate)78.3 | 6 |