Frequency-Guided Word Substitutions for Detecting Textual Adversarial Examples

About

Recent efforts have shown that neural text processing models are vulnerable to adversarial examples, but the nature of these examples is poorly understood. In this work, we show that adversarial attacks against CNN, LSTM and Transformer-based classification models perform word substitutions that are identifiable through frequency differences between replaced words and their corresponding substitutions. Based on these findings, we propose frequency-guided word substitutions (FGWS), a simple algorithm exploiting the frequency properties of adversarial word substitutions for the detection of adversarial examples. FGWS achieves strong performance by accurately detecting adversarial examples on the SST-2 and IMDb sentiment datasets, with F1 detection scores of up to 91.4% against RoBERTa-based classification models. We compare our approach against a recently proposed perturbation discrimination framework and show that we outperform it by up to 13.0% F1.

Maximilian Mozes, Pontus Stenetorp, Bennett Kleinberg, Lewis D. Griffin• 2020

Related benchmarks

Task	Dataset	Result
Text Adversarial Example Detection	AG-News	TPR@1091	28
Text Adversarial Example Detection	Yelp	TPR@1088.9	28
Text Adversarial Example Detection	IMDB	TPR@1088.2	28
Adversarial Text Detection	IMDB	F1 Score89.8	25
Adversarial Text Detection	IMDB (test)	F1 Score89.8	24
Adversarial Text Detection	AG-News	F1 Score90.6	24
Adversarial Detection	AG-News	F1 Score90.6	18
Adversarial Text Detection	Yelp	F1 Score91.2	15
Adversarial Detection	RTMR	F1 Score78.9	12
Adversarial Text Detection	RTMR	F1 Score78.9	11

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord