Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Acquiring Clean Language Models from Backdoor Poisoned Datasets by Downscaling Frequency Space

About

Despite the notable success of language models (LMs) in various natural language processing (NLP) tasks, the reliability of LMs is susceptible to backdoor attacks. Prior research attempts to mitigate backdoor learning while training the LMs on the poisoned dataset, yet struggles against complex backdoor attacks in real-world scenarios. In this paper, we investigate the learning mechanisms of backdoor LMs in the frequency space by Fourier analysis. Our findings indicate that the backdoor mapping presented on the poisoned datasets exhibits a more discernible inclination towards lower frequency compared to clean mapping, resulting in the faster convergence of backdoor mapping. To alleviate this dilemma, we propose Multi-Scale Low-Rank Adaptation (MuScleLoRA), which deploys multiple radial scalings in the frequency space with low-rank adaptation to the target model and further aligns the gradients when updating parameters. Through downscaling in the frequency space, MuScleLoRA encourages the model to prioritize the learning of relatively high-frequency clean mapping, consequently mitigating backdoor learning. Experimental results demonstrate that MuScleLoRA outperforms baselines significantly. Notably, MuScleLoRA reduces the average success rate of diverse backdoor attacks to below 15\% across multiple datasets and generalizes to various backbone LMs, including BERT, RoBERTa, GPT2-XL, and Llama2. The codes are publicly available at https://github.com/ZrW00/MuScleLoRA.

Zongru Wu, Zhuosheng Zhang, Pengzhou Cheng, Gongshen Liu• 2024

Related benchmarks

TaskDatasetResultRank
Backdoor DefenseAGNews
Attack Success Rate2.03
105
Backdoor DefenseSST-2
CACC87.81
65
Sentiment AnalysisSST-2 (test)
Clean Accuracy94.73
50
Backdoor Attack ClassificationHSOL
ASR24.31
50
Backdoor MitigationLingspam
Clean Accuracy95.52
20
Sentiment AnalysisSST-2 (test)
CACC (Badnet)93.3
15
Topic ClassificationAG News (test)
Badnets CACC90.21
15
Sentiment SteeringLlama2-7B Generation Evaluation Set
Accuracy (CA)83.94
15
Sentiment SteeringMistral-7B Generation (Evaluation Set)
Control Accuracy (CA)87.14
15
Targeted RefusalLlama2-7B Generation Evaluation Set
Completion Accuracy (CA)83.15
15
Showing 10 of 15 rows

Other info

Code

Follow for update