Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Patronus: Identifying and Mitigating Transferable Backdoors in Pre-trained Language Models

About

Transferable backdoors pose a severe threat to the Pre-trained Language Models (PLMs) supply chain, yet defensive research remains nascent, primarily relying on detecting anomalies in the output feature space. We identify a critical flaw that fine-tuning on downstream tasks inevitably modifies model parameters, shifting the output distribution and rendering pre-computed defense ineffective. To address this, we propose Patronus, a novel framework that use input-side invariance of triggers against parameter shifts. To overcome the convergence challenges of discrete text optimization, Patronus introduces a multi-trigger contrastive search algorithm that effectively bridges gradient-based optimization with contrastive learning objectives. Furthermore, we employ a dual-stage mitigation strategy combining real-time input monitoring with model purification via adversarial training. Extensive experiments across 15 PLMs and 10 tasks demonstrate that Patronus achieves $\geq98.7\%$ backdoor detection recall and reduce attack success rates to clean settings, significantly outperforming all state-of-the-art baselines in all settings. Code is available at https://github.com/zth855/Patronus.

Tianhang Zhao, Wei Du, Haodong Zhao, Sufeng Duan, Gongshen Liu• 2025

Related benchmarks

TaskDatasetResultRank
Text ClassificationAGNews
Clean Accuracy94.2
118
Text ClassificationYelp
Accuracy65.36
17
Text ClassificationENRON
ACC99.36
17
Text ClassificationTwit
Accuracy94.53
17
Trigger SearchBackdoored Models Token-Level BERT-based (test)
Recall (NeuBA)100
9
Trigger SearchBackdoored Models Word-Level BERT-based (test)
Recall (NeuBA)86.32
9
Showing 6 of 6 rows

Other info

Follow for update