Patronus: Identifying and Mitigating Transferable Backdoors in Pre-trained Language Models

About

Transferable backdoors pose a severe threat to the Pre-trained Language Models (PLMs) supply chain, yet defensive research remains nascent, primarily relying on detecting anomalies in the output feature space. We identify a critical flaw that fine-tuning on downstream tasks inevitably modifies model parameters, shifting the output distribution and rendering pre-computed defense ineffective. To address this, we propose Patronus, a novel framework that use input-side invariance of triggers against parameter shifts. To overcome the convergence challenges of discrete text optimization, Patronus introduces a multi-trigger contrastive search algorithm that effectively bridges gradient-based optimization with contrastive learning objectives. Furthermore, we employ a dual-stage mitigation strategy combining real-time input monitoring with model purification via adversarial training. Extensive experiments across 15 PLMs and 10 tasks demonstrate that Patronus achieves $\geq98.7\%$ backdoor detection recall and reduce attack success rates to clean settings, significantly outperforming all state-of-the-art baselines in all settings. Code is available at https://github.com/zth855/Patronus.

Tianhang Zhao, Wei Du, Haodong Zhao, Sufeng Duan, Gongshen Liu• 2025

Related benchmarks

Task	Dataset	Result
Text Classification	AGNews	Clean Accuracy94.2	118
Text Classification	Yelp	Accuracy65.36	17
Text Classification	ENRON	ACC99.36	17
Text Classification	Twit	Accuracy94.53	17
Trigger Search	Backdoored Models Token-Level BERT-based (test)	Recall (NeuBA)100	9
Trigger Search	Backdoored Models Word-Level BERT-based (test)	Recall (NeuBA)86.32	9

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord