Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LLM-Agnostic Semantic Representation Attack

About

Large Language Models (LLMs) increasingly employ alignment techniques to prevent harmful outputs. Despite these safeguards, attackers can circumvent them by crafting adversarial prompts. Predominant token-level optimization methods primarily rely on optimizing for exact affirmative templates (e.g., ``\textit{Sure, here is...}''). However, these paradigms frequently encounter bottlenecks such as suboptimal convergence, compromised prompt naturalness, and poor cross-model generalization. To address these limitations, we propose Semantic Representation Attack (SRA), a novel LLM-agnostic paradigm that fundamentally reconceptualizes adversarial objectives from exact textual targeting to malicious semantic representations. Theoretically, we establish the semantic Coherence-Convergence Relationship and derive a Cross-Model Semantic Generalization bound, proving that maintaining semantic coherence guarantees both white-box semantic convergence and black-box transferability. Technically, we operationalize this framework via the Semantic Representation Heuristic Search (SRHS) algorithm, which preserves interpretability and structural coherence of the adversarial prompts during incremental discrete token chunk expansion. Extensive evaluations demonstrate that our framework achieves a 99.71% average attack success rate across 26 open-source LLMs, with strong transferability and stealth.

Jiawei Lian, Jianhong Pan, Lefan Wang, Yi Wang, Tairan Huang, Shaohui Mei, Lap-Pui Chau• 2026

Related benchmarks

TaskDatasetResultRank
Jailbreak AttackHarmBench--
557
Adversarial AttackAdvBench (test)
ASR100
145
Adversarial jailbreak attackVicuna 7B
Attack Success Rate (ASR)98.46
58
Adversarial jailbreak attackGuanaco 7B
Attack Success Rate (ASR)100
58
Adversarial jailbreak attackVicuna 13B
Attack Success Rate (ASR)98.65
55
Adversarial AttackMistral-7B
ASR100
45
Adversarial jailbreak attackMistral-7B
Attack Success Rate (ASR)100
13
LLM Jailbreaking AttackAdvBench (100 samples)
Attack Success Rate100
12
Transfer AttackGPT-5
Attack Success Rate18.67
9
Transfer AttackGPT 4.1
Attack Success Rate (ASR)44.33
9
Showing 10 of 10 rows

Other info

Follow for update