Universal Adversarial Triggers for Attacking and Analyzing NLP

About

Adversarial examples highlight model vulnerabilities and are useful for evaluation and interpretation. We define universal adversarial triggers: input-agnostic sequences of tokens that trigger a model to produce a specific prediction when concatenated to any input from a dataset. We propose a gradient-guided search over tokens which finds short trigger sequences (e.g., one word for classification and four words for language modeling) that successfully trigger the target prediction. For example, triggers cause SNLI entailment accuracy to drop from 89.94% to 0.55%, 72% of "why" questions in SQuAD to be answered "to kill american people", and the GPT-2 language model to spew racist output even when conditioned on non-racial contexts. Furthermore, although the triggers are optimized using white-box access to a specific model, they transfer to other models for all tasks we consider. Finally, since triggers are input-agnostic, they provide an analysis of global model behavior. For instance, they confirm that SNLI models exploit dataset biases and help to diagnose heuristics learned by reading comprehension models.

Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, Sameer Singh• 2019

Related benchmarks

Task	Dataset	Result
Jailbreak Attack	HarmBench (test)	--	212
Token-forcing loss optimization	Random targets Held-out (val)	Qwen-2.5-7B Loss12.01	56
Sentiment Classification	SST-2	Delta Accuracy0.05	24
Boolean Question Answering	BoolQ	Delta Accuracy-0.01	24
Paraphrase Detection	MRPC	Delta Accuracy-0.01	24
Physical Commonsense Reasoning	PIQA	Delta Accuracy0.00e+0	24
Adversarial Attack	BST	GL19.13	18
Adversarial Attack	PC	GL17.91	18
Adversarial Attack	CV2	GL15.17	18
Adversarial Attack	ED	GL20.72	18

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord