Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Universal Adversarial Triggers for Attacking and Analyzing NLP

About

Adversarial examples highlight model vulnerabilities and are useful for evaluation and interpretation. We define universal adversarial triggers: input-agnostic sequences of tokens that trigger a model to produce a specific prediction when concatenated to any input from a dataset. We propose a gradient-guided search over tokens which finds short trigger sequences (e.g., one word for classification and four words for language modeling) that successfully trigger the target prediction. For example, triggers cause SNLI entailment accuracy to drop from 89.94% to 0.55%, 72% of "why" questions in SQuAD to be answered "to kill american people", and the GPT-2 language model to spew racist output even when conditioned on non-racial contexts. Furthermore, although the triggers are optimized using white-box access to a specific model, they transfer to other models for all tasks we consider. Finally, since triggers are input-agnostic, they provide an analysis of global model behavior. For instance, they confirm that SNLI models exploit dataset biases and help to diagnose heuristics learned by reading comprehension models.

Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, Sameer Singh• 2019

Related benchmarks

TaskDatasetResultRank
Sentiment ClassificationSST-2
Delta Accuracy0.05
24
Boolean Question AnsweringBoolQ
Delta Accuracy-0.01
24
Paraphrase DetectionMRPC
Delta Accuracy-0.01
24
Physical Commonsense ReasoningPIQA
Delta Accuracy0.00e+0
24
Adversarial AttackBST
GL19.13
18
Adversarial AttackPC
GL17.91
18
Adversarial AttackCV2
GL15.17
18
Adversarial AttackED
GL20.72
18
Question AnsweringBoolQ
Delta Accuracy-0.08
15
Question AnsweringPIQA--
6
Showing 10 of 13 rows

Other info

Follow for update