RAP: Robustness-Aware Perturbations for Defending against Backdoor Attacks on NLP Models

About

Backdoor attacks, which maliciously control a well-trained model's outputs of the instances with specific triggers, are recently shown to be serious threats to the safety of reusing deep neural networks (DNNs). In this work, we propose an efficient online defense mechanism based on robustness-aware perturbations. Specifically, by analyzing the backdoor training process, we point out that there exists a big gap of robustness between poisoned and clean samples. Motivated by this observation, we construct a word-based robustness-aware perturbation to distinguish poisoned samples from clean samples to defend against the backdoor attacks on natural language processing (NLP) models. Moreover, we give a theoretical analysis about the feasibility of our robustness-aware perturbation-based defense method. Experimental results on sentiment analysis and toxic detection tasks show that our method achieves better defending performance and much lower computational costs than existing online defense methods. Our code is available at https://github.com/lancopku/RAP.

Wenkai Yang, Yankai Lin, Peng Li, Jie Zhou, Xu Sun• 2021

Related benchmarks

Task	Dataset	Result
Backdoor Defense	AGNews	Attack Success Rate59.67	105
Poisoned sample detection	TrojAI round 6 (test)	Precision0.853	96
Text Classification	Subj	CA (%)0.967	94
Sentiment Classification	SST-2 64 instances (test)	Accuracy90.37	80
Backdoor Defense	Average of four datasets (test)	Accuracy89.95	76
Topic Classification	AG's News	ASR33.67	70
Backdoor Defense	SST-2	CACC91.71	65
Bias Defense	Average of four datasets (test)	Accuracy89.98	56
Text Classification	CR	CA91.45	55
Backdoor Attack Classification	HSOL	ASR100	50

Showing 10 of 31 rows

Other info

Follow for update

@wizwand_team Discord