Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts

About

Safety concerns in large language models (LLMs) have gained significant attention due to their exposure to potentially harmful data during pre-training. In this paper, we identify a new safety vulnerability in LLMs: their susceptibility to \textit{natural distribution shifts} between attack prompts and original toxic prompts, where seemingly benign prompts, semantically related to harmful content, can bypass safety mechanisms. To explore this issue, we introduce a novel attack method, \textit{ActorBreaker}, which identifies actors related to toxic prompts within pre-training distribution to craft multi-turn prompts that gradually lead LLMs to reveal unsafe content. ActorBreaker is grounded in Latour's actor-network theory, encompassing both human and non-human actors to capture a broader range of vulnerabilities. Our experimental results demonstrate that ActorBreaker outperforms existing attack methods in terms of diversity, effectiveness, and efficiency across aligned LLMs. To address this vulnerability, we propose expanding safety training to cover a broader semantic space of toxic content. We thus construct a multi-turn safety dataset using ActorBreaker. Fine-tuning models on our dataset shows significant improvements in robustness, though with some trade-offs in utility. Code is available at https://github.com/AI45Lab/ActorAttack.

Qibing Ren, Hao Li, Dongrui Liu, Zhanxu Xie, Xiaoya Lu, Yu Qiao, Lei Sha, Junchi Yan, Lizhuang Ma, Jing Shao• 2024

Related benchmarks

TaskDatasetResultRank
Jailbreak AttackHarmBench
Attack Success Rate (ASR)49
376
Jailbreak AttackAdvBench
AASR47.37
247
Jailbreak AttackJailbreakBench
ASR35
54
JailbreakingAdvBench--
44
Transferable Adversarial AttackAdvBench LLM Classifier (test)
TASR@17.11e+3
39
Transferable Adversarial AttackHarmBench Classifier (test)
TASR@175
37
Multi-turn JailbreakingStrongReject (test)
ASR0.63
30
Jailbreak AttackRedTeam 2K
ASR47.37
16
Jailbreak AttackJailbreak Evaluation GPT-4o-mini
ASR64
13
JailbreakingAdvBench (test)
ASR (GPT-3.5)47.5
12
Showing 10 of 19 rows

Other info

Code

Follow for update