Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1

About

Despite promising performance on open-source large vision-language models (LVLMs), transfer-based targeted attacks often fail against closed-source commercial LVLMs. Analyzing failed adversarial perturbations reveals that the learned perturbations typically originate from a uniform distribution and lack clear semantic details, resulting in unintended responses. This critical absence of semantic information leads commercial black-box LVLMs to either ignore the perturbation entirely or misinterpret its embedded semantics, thereby causing the attack to fail. To overcome these issues, we propose to refine semantic clarity by encoding explicit semantic details within local regions, thus ensuring the capture of finer-grained features and inter-model transferability, and by concentrating modifications on semantically rich areas rather than applying them uniformly. To achieve this, we propose a simple yet highly effective baseline: at each optimization step, the adversarial image is cropped randomly by a controlled aspect ratio and scale, resized, and then aligned with the target image in the embedding space. While the naive source-target matching method has been utilized before in the literature, we are the first to provide a tight analysis, which establishes a close connection between perturbation optimization and semantics. Experimental results confirm our hypothesis. Our adversarial examples crafted with local-aggregated perturbations focused on crucial regions exhibit surprisingly good transferability to commercial LVLMs, including GPT-4.5, GPT-4o, Gemini-2.0-flash, Claude-3.5/3.7-sonnet, and even reasoning models like o1, Claude-3.7-thinking and Gemini-2.0-flash-thinking. Our approach achieves success rates exceeding 90% on GPT-4.5, 4o, and o1, significantly outperforming all prior state-of-the-art attack methods with lower $\ell_1/\ell_2$ perturbations.

Zhaoyi Li, Xiaohan Zhao, Dong-Dong Wu, Jiacheng Cui, Zhiqiang Shen• 2025

Related benchmarks

TaskDatasetResultRank
Adversarial AttackNIPS Adversarial Attacks and Defenses Competition dataset 2017
ASR57
25
Geolocation Inference Privacy ProtectionDoxBench (test)
Top-1 Protection Rate (Region)34
21
Universal Targeted Adversarial AttackSeen Samples (Used for Optimization) (train)
KMRa15
18
Universal Targeted Adversarial AttackUnseen (test)
KMRa5
18
Black-Box LVLM AttackPatternNet
KMRa86
15
Adversarial AttackChestMNIST (test)
KMRa0.31
15
Imperceptibility EvaluationBlack-Box LVLM Attack Set
L1 Distance0.03
9
Black-box Adversarial AttackGemini 2.5-Pro
KMRa0.81
9
Black-box Adversarial AttackGPT-5
KMRa89
9
Black-box Adversarial AttackClaude thinking 4.0
KMR (a)0.12
9
Showing 10 of 19 rows

Other info

Follow for update