Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Efficient Adversarial Training in LLMs with Continuous Attacks

About

Large language models (LLMs) are vulnerable to adversarial attacks that can bypass their safety guardrails. In many domains, adversarial training has proven to be one of the most promising methods to reliably improve robustness against such attacks. Yet, in the context of LLMs, current methods for adversarial training are hindered by the high computational costs required to perform discrete adversarial attacks at each training iteration. We address this problem by instead calculating adversarial attacks in the continuous embedding space of the LLM, which is orders of magnitudes more efficient. We propose a fast adversarial training algorithm (C-AdvUL) composed of two losses: the first makes the model robust on continuous embedding attacks computed on an adversarial behaviour dataset; the second ensures the usefulness of the final model by fine-tuning on utility data. Moreover, we introduce C-AdvIPO, an adversarial variant of IPO that does not require utility data for adversarially robust alignment. Our empirical evaluation on five models from different families (Gemma, Phi3, Mistral, Zephyr, Llama2) and at different scales (2B, 3.8B, 7B) shows that both algorithms substantially enhance LLM robustness against discrete attacks (GCG, AutoDAN, PAIR), while maintaining utility. Our results demonstrate that robustness to continuous perturbations can extrapolate to discrete threat models. Thereby, we present a path toward scalable adversarial training algorithms for robustly aligning LLMs.

Sophie Xhonneux, Alessandro Sordoni, Stephan G\"unnemann, Gauthier Gidel, Leo Schwinn• 2024

Related benchmarks

TaskDatasetResultRank
Multi-task Language UnderstandingMMLU--
881
Multi-turn Dialogue EvaluationMT-Bench
Overall Score60.4
532
Question AnsweringARC-E
Accuracy77.5
523
Multi-task Language UnderstandingMMLU
MMLU Accuracy60.5
442
Instruction FollowingAlpacaEval--
420
Question AnsweringARC-C
Accuracy51.5
258
ReasoningARC Easy--
233
Mathematical ReasoningGSM8K
GSM8K Accuracy (%)67.7
204
General Knowledge EvaluationMMLU
MMLU Accuracy78.7
127
Over-refusalXSTest
Overrefusal Rate7
102
Showing 10 of 30 rows

Other info

Code

Follow for update