Backdoor Attacks on Pre-trained Models by Layerwise Weight Poisoning
About
\textbf{P}re-\textbf{T}rained \textbf{M}odel\textbf{s} have been widely applied and recently proved vulnerable under backdoor attacks: the released pre-trained weights can be maliciously poisoned with certain triggers. When the triggers are activated, even the fine-tuned model will predict pre-defined labels, causing a security threat. These backdoors generated by the poisoning methods can be erased by changing hyper-parameters during fine-tuning or detected by finding the triggers. In this paper, we propose a stronger weight-poisoning attack method that introduces a layerwise weight poisoning strategy to plant deeper backdoors; we also introduce a combinatorial trigger that cannot be easily detected. The experiments on text classification tasks show that previous defense methods cannot resist our weight-poisoning method, which indicates that our method can be widely applied and may provide hints for future model robustness studies.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text Classification | HSOL | CACC95.78 | 26 | |
| Backdoor Attack Classification | HSOL | ASR94.15 | 26 | |
| Text Classification | SST-2 (test) | CACC91.87 | 17 | |
| Backdoor Trigger Quality Assessment | HSOL | APPL1.49e+3 | 6 | |
| Text Classification | SST-2 → IMDB (test) | ASR61.02 | 6 | |
| Text Classification | IMDB → SST-2 (test) | ASR90.57 | 6 | |
| Cross-dataset Backdoor Attack Classification | OffensEval from HSOL | ASR72.38 | 6 | |
| Trigger Stealthiness | CounterFact | Similarity89.83 | 5 | |
| Trigger Stealthiness | CoNLL | Similarity92.09 | 5 | |
| Trigger Stealthiness | SST-2 | Similarity Score86.85 | 5 |