Defending Pre-trained Language Models as Few-shot Learners against Backdoor Attacks

About

Pre-trained language models (PLMs) have demonstrated remarkable performance as few-shot learners. However, their security risks under such settings are largely unexplored. In this work, we conduct a pilot study showing that PLMs as few-shot learners are highly vulnerable to backdoor attacks while existing defenses are inadequate due to the unique challenges of few-shot scenarios. To address such challenges, we advocate MDP, a novel lightweight, pluggable, and effective defense for PLMs as few-shot learners. Specifically, MDP leverages the gap between the masking-sensitivity of poisoned and clean samples: with reference to the limited few-shot data as distributional anchors, it compares the representations of given samples under varying masking and identifies poisoned samples as ones with significant variations. We show analytically that MDP creates an interesting dilemma for the attacker to choose between attack effectiveness and detection evasiveness. The empirical evaluation using benchmark datasets and representative attacks validates the efficacy of MDP.

Zhaohan Xi, Tianyu Du, Changjiang Li, Ren Pang, Shouling Ji, Jinghui Chen, Fenglong Ma, Ting Wang• 2023

Related benchmarks

Task	Dataset	Result
Sentiment Classification	SST2 (test)	Accuracy89.02	233
Sentiment Classification	IMDB (test)	--	144
Text Classification	Subj	CA (%)0.967	94
Backdoor Defense	SST-2	--	65
Text Classification	CR	CA91.45	55
Text Generation	Medical Chatbot	ASR82.56	42
Backdoor Defense	MR	AUC0.99	20
Backdoor Defense	CR	AUC1	20
Backdoor Defense	Subj	AUC0.99	20
Backdoor Defense	TREC	AUC0.99	20

Showing 10 of 14 rows

Other info

Code

Follow for update

@wizwand_team Discord