Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Hierarchical Refinement of Universal Multimodal Attacks on Vision-Language Models

About

Existing adversarial attacks for VLP models are mostly sample-specific, resulting in substantial computational overhead when scaled to large datasets or new scenarios. To overcome this limitation, we propose Hierarchical Refinement Attack (HRA), a multimodal universal attack framework for VLP models. For the image modality, we refine the optimization path by leveraging a temporal hierarchy of historical and estimated future gradients to avoid local minima and stabilize universal perturbation learning. For the text modality, it hierarchically models textual importance by considering both intra- and inter-sentence contributions to identify globally influential words, which are then used as universal text perturbations. Extensive experiments across various downstream tasks, VLP models, and datasets, demonstrate the superior transferability of the proposed universal multimodal attacks.

Peng-Fei Zhang, Zi Huang• 2026

Related benchmarks

TaskDatasetResultRank
Text-to-Image RetrievalFlickr30K
R@166.46
460
Image-to-Text RetrievalFlickr30K
R@146.51
379
Visual GroundingRefCOCO+ (val)
Accuracy34.21
171
Visual GroundingRefCOCO+ (testB)
Accuracy31.85
169
Visual GroundingRefCOCO+ (testA)
Accuracy34.93
168
Image CaptioningMSCOCO (test)
CIDEr108.2
29
Showing 6 of 6 rows

Other info

Follow for update