HotFlip: White-Box Adversarial Examples for Text Classification

About

We propose an efficient method to generate white-box adversarial examples to trick a character-level neural classifier. We find that only a few manipulations are needed to greatly decrease the accuracy. Our method relies on an atomic flip operation, which swaps one token for another, based on the gradients of the one-hot input vectors. Due to efficiency of our method, we can perform adversarial training which makes the model more robust to attacks at test time. With the use of a few semantics-preserving constraints, we demonstrate that HotFlip can be adapted to attack a word-level classifier as well.

Javid Ebrahimi, Anyi Rao, Daniel Lowd, Dejing Dou• 2017

Related benchmarks

Task	Dataset	Result
Text Classification	Emotion	ASR (%)0.4165	36
Adversarial Attack	GLUE	SST-2 Speedup3.3	32
Natural Language Understanding	GLUE	SST-2 Speedup2.77	32
Adversarial Attack	CV2	GL16.38	18
Adversarial Attack	ED	GL22.26	18
Adversarial Attack	BST	GL16.86	18
Adversarial Attack	PC	GL17.51	18
Text Translation	En-Fr	BLEU Score0.24	14
Text Translation	En-Zh	BLEU0.24	14
In-domain corpus poisoning attack	NQ	--	8

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord