Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

"That Is a Suspicious Reaction!": Interpreting Logits Variation to Detect NLP Adversarial Attacks

About

Adversarial attacks are a major challenge faced by current machine learning research. These purposely crafted inputs fool even the most advanced models, precluding their deployment in safety-critical applications. Extensive research in computer vision has been carried to develop reliable defense strategies. However, the same issue remains less explored in natural language processing. Our work presents a model-agnostic detector of adversarial text examples. The approach identifies patterns in the logits of the target classifier when perturbing the input text. The proposed detector improves the current state-of-the-art performance in recognizing adversarial inputs and exhibits strong generalization capabilities across different NLP models, datasets, and word-level attacks.

Edoardo Mosca, Shreyash Agarwal, Javier Rando, Georg Groh• 2022

Related benchmarks

TaskDatasetResultRank
Text Adversarial Example DetectionYelp
TPR@10100
28
Text Adversarial Example DetectionIMDB
TPR@1098.2
28
Text Adversarial Example DetectionAG-News
TPR@1096.5
28
Adversarial Text DetectionIMDB
F1 Score94.2
25
Adversarial Text DetectionAG-News
F1 Score95.7
24
Adversarial Text DetectionIMDB (test)
F1 Score94.2
24
Adversarial DetectionAG-News
F1 Score95.7
18
Adversarial Text DetectionYelp
F1 Score95.9
15
Adversarial DetectionRTMR
F1 Score75.8
12
Adversarial Text DetectionRTMR
F1 Score75.8
11
Showing 10 of 15 rows

Other info

Code

Follow for update