Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

"That Is a Suspicious Reaction!": Interpreting Logits Variation to Detect NLP Adversarial Attacks

About

Adversarial attacks are a major challenge faced by current machine learning research. These purposely crafted inputs fool even the most advanced models, precluding their deployment in safety-critical applications. Extensive research in computer vision has been carried to develop reliable defense strategies. However, the same issue remains less explored in natural language processing. Our work presents a model-agnostic detector of adversarial text examples. The approach identifies patterns in the logits of the target classifier when perturbing the input text. The proposed detector improves the current state-of-the-art performance in recognizing adversarial inputs and exhibits strong generalization capabilities across different NLP models, datasets, and word-level attacks.

Edoardo Mosca, Shreyash Agarwal, Javier Rando, Georg Groh• 2022

Related benchmarks

TaskDatasetResultRank
Adversarial Text DetectionIMDB
F1 Score94.2
25
Adversarial Text DetectionAG-News
F1 Score95.7
24
Adversarial Text DetectionIMDB (test)
F1 Score94.2
24
Adversarial DetectionAG-News
F1 Score95.7
18
Adversarial Text DetectionYelp
F1 Score95.9
15
Adversarial DetectionRTMR
F1 Score75.8
12
Adversarial Text DetectionRTMR
F1 Score75.8
11
Adversarial Text DetectionYelp (test)
F10.959
7
Adversarial Text DetectionAG News (test)
F1 Score94
6
Adversarial AttackAG News (test)
Attack Success Rate0.052
3
Showing 10 of 12 rows

Other info

Code

Follow for update