"That Is a Suspicious Reaction!": Interpreting Logits Variation to Detect NLP Adversarial Attacks
About
Adversarial attacks are a major challenge faced by current machine learning research. These purposely crafted inputs fool even the most advanced models, precluding their deployment in safety-critical applications. Extensive research in computer vision has been carried to develop reliable defense strategies. However, the same issue remains less explored in natural language processing. Our work presents a model-agnostic detector of adversarial text examples. The approach identifies patterns in the logits of the target classifier when perturbing the input text. The proposed detector improves the current state-of-the-art performance in recognizing adversarial inputs and exhibits strong generalization capabilities across different NLP models, datasets, and word-level attacks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Adversarial Text Detection | IMDB | F1 Score94.2 | 25 | |
| Adversarial Text Detection | AG-News | F1 Score95.7 | 24 | |
| Adversarial Text Detection | IMDB (test) | F1 Score94.2 | 24 | |
| Adversarial Detection | AG-News | F1 Score95.7 | 18 | |
| Adversarial Text Detection | Yelp | F1 Score95.9 | 15 | |
| Adversarial Detection | RTMR | F1 Score75.8 | 12 | |
| Adversarial Text Detection | RTMR | F1 Score75.8 | 11 | |
| Adversarial Text Detection | Yelp (test) | F10.959 | 7 | |
| Adversarial Text Detection | AG News (test) | F1 Score94 | 6 | |
| Adversarial Attack | AG News (test) | Attack Success Rate0.052 | 3 |