Uncovering Safety Risks of Large Language Models through Concept Activation Vector

About

Despite careful safety alignment, current large language models (LLMs) remain vulnerable to various attacks. To further unveil the safety risks of LLMs, we introduce a Safety Concept Activation Vector (SCAV) framework, which effectively guides the attacks by accurately interpreting LLMs' safety mechanisms. We then develop an SCAV-guided attack method that can generate both attack prompts and embedding-level attacks with automatically selected perturbation hyperparameters. Both automatic and human evaluations demonstrate that our attack method significantly improves the attack success rate and response quality while requiring less training data. Additionally, we find that our generated attack prompts may be transferable to GPT-4, and the embedding-level attacks may also be transferred to other white-box LLMs whose parameters are known. Our experiments further uncover the safety risks present in current LLMs. For example, in our evaluation of seven open-source LLMs, we observe an average attack success rate of 99.14%, based on the classic keyword-matching criterion. Finally, we provide insights into the safety mechanism of LLMs. The code is available at https://github.com/SproutNan/AI-Safety_SCAV.

Zhihao Xu, Ruixuan Huang, Changyu Chen, Xiting Wang• 2024

Related benchmarks

Task	Dataset	Result
Jailbreak Attack	MaliciousInstruct	ASR92	161
Jailbreaking	JailbreakBench	Attack Success Rate (ASR)70	53
Jailbreak Attack	JailbreakBench	ASR85	27
Jailbreak Attack	Gemma-7b five finetuned variants	Average ASR41.8	16
Jailbreak Attack Transferability	Llama-3-8b-Instruct finetuned variants v1 (test)	TSR31.8	16
Jailbreak Attack Transferability	DeepSeek-llm-7b-chat finetuned variants v1 (test)	TSR69.6	16
Jailbreak Attack Transferability	Llama-2-7b-chat finetuned variants v1 (test)	Transfer Success Rate (TSR)28	16
Jailbreak Attack Transferability	Gemma-7b-it finetuned variants v1 (test)	TSR37.2	16
Jailbreak Attack	Llama2-7b five finetuned variants	Average ASR28	16
Jailbreak Attack	LLaMA3-8B	Average ASR31.8	16

Showing 10 of 37 rows

Other info

Follow for update

@wizwand_team Discord