"That Is a Suspicious Reaction!": Interpreting Logits Variation to Detect NLP Adversarial Attacks

About

Adversarial attacks are a major challenge faced by current machine learning research. These purposely crafted inputs fool even the most advanced models, precluding their deployment in safety-critical applications. Extensive research in computer vision has been carried to develop reliable defense strategies. However, the same issue remains less explored in natural language processing. Our work presents a model-agnostic detector of adversarial text examples. The approach identifies patterns in the logits of the target classifier when perturbing the input text. The proposed detector improves the current state-of-the-art performance in recognizing adversarial inputs and exhibits strong generalization capabilities across different NLP models, datasets, and word-level attacks.

Edoardo Mosca, Shreyash Agarwal, Javier Rando, Georg Groh• 2022

Related benchmarks

Task	Dataset	Result
Text Adversarial Example Detection	Yelp	TPR@10100	28
Text Adversarial Example Detection	IMDB	TPR@1098.2	28
Text Adversarial Example Detection	AG-News	TPR@1096.5	28
Adversarial Text Detection	IMDB	F1 Score94.2	25
Adversarial Text Detection	AG-News	F1 Score95.7	24
Adversarial Text Detection	IMDB (test)	F1 Score94.2	24
Adversarial Attack	IMDB (test)	Success Rate16.4	21
Adversarial Detection	AG-News	F1 Score95.7	18
Adversarial Text Detection	Yelp	F1 Score95.9	15
Adversarial Detection	RTMR	F1 Score75.8	12

Showing 10 of 15 rows

Other info

Code

Follow for update

@wizwand_team Discord