Acquiring Clean Language Models from Backdoor Poisoned Datasets by Downscaling Frequency Space

About

Despite the notable success of language models (LMs) in various natural language processing (NLP) tasks, the reliability of LMs is susceptible to backdoor attacks. Prior research attempts to mitigate backdoor learning while training the LMs on the poisoned dataset, yet struggles against complex backdoor attacks in real-world scenarios. In this paper, we investigate the learning mechanisms of backdoor LMs in the frequency space by Fourier analysis. Our findings indicate that the backdoor mapping presented on the poisoned datasets exhibits a more discernible inclination towards lower frequency compared to clean mapping, resulting in the faster convergence of backdoor mapping. To alleviate this dilemma, we propose Multi-Scale Low-Rank Adaptation (MuScleLoRA), which deploys multiple radial scalings in the frequency space with low-rank adaptation to the target model and further aligns the gradients when updating parameters. Through downscaling in the frequency space, MuScleLoRA encourages the model to prioritize the learning of relatively high-frequency clean mapping, consequently mitigating backdoor learning. Experimental results demonstrate that MuScleLoRA outperforms baselines significantly. Notably, MuScleLoRA reduces the average success rate of diverse backdoor attacks to below 15\% across multiple datasets and generalizes to various backbone LMs, including BERT, RoBERTa, GPT2-XL, and Llama2. The codes are publicly available at https://github.com/ZrW00/MuScleLoRA.

Zongru Wu, Zhuosheng Zhang, Pengzhou Cheng, Gongshen Liu• 2024

Related benchmarks

Task	Dataset	Result
Backdoor Defense	AGNews	Attack Success Rate2.03	105
Backdoor Defense	SST-2	CACC87.81	65
Sentiment Analysis	SST-2 (test)	Clean Accuracy94.73	50
Backdoor Attack Classification	HSOL	ASR24.31	50
Backdoor Mitigation	Lingspam	Clean Accuracy95.52	20
Sentiment Analysis	SST-2 (test)	CACC (Badnet)93.3	15
Topic Classification	AG News (test)	Badnets CACC90.21	15
Sentiment Steering	Llama2-7B Generation Evaluation Set	Accuracy (CA)83.94	15
Sentiment Steering	Mistral-7B Generation (Evaluation Set)	Control Accuracy (CA)87.14	15
Targeted Refusal	Llama2-7B Generation Evaluation Set	Completion Accuracy (CA)83.15	15

Showing 10 of 15 rows

Other info

Code

Follow for update

@wizwand_team Discord