Weight Poisoning Attacks on Pre-trained Models
About
Recently, NLP has seen a surge in the usage of large pre-trained models. Users download weights of models pre-trained on large datasets, then fine-tune the weights on a task of their choice. This raises the question of whether downloading untrusted pre-trained weights can pose a security threat. In this paper, we show that it is possible to construct ``weight poisoning'' attacks where pre-trained weights are injected with vulnerabilities that expose ``backdoors'' after fine-tuning, enabling the attacker to manipulate the model prediction simply by injecting an arbitrary keyword. We show that by applying a regularization method, which we call RIPPLe, and an initialization procedure, which we call Embedding Surgery, such attacks are possible even with limited knowledge of the dataset and fine-tuning procedure. Our experiments on sentiment classification, toxicity detection, and spam detection show that this attack is widely applicable and poses a serious threat. Finally, we outline practical defenses against such attacks. Code to reproduce our experiments is available at https://github.com/neulab/RIPPLe.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Topic Classification | AG's News (test) | CACC91.7 | 43 | |
| Offensive Language Identification | OLID (test) | CACC80.46 | 33 | |
| Backdoor Attack Classification | HSOL | ASR100 | 26 | |
| Text Classification | HSOL | CACC94.85 | 26 | |
| Text Classification | SST-2 (test) | CACC91.1 | 17 | |
| Cross-dataset Backdoor Attack Classification | OffensEval from HSOL | ASR100 | 6 | |
| Text Classification | IMDB → SST-2 (test) | ASR100 | 6 | |
| Backdoor Trigger Quality Assessment | HSOL | APPL1.10e+3 | 6 | |
| Text Classification | SST-2 → IMDB (test) | ASR16.81 | 6 |