Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Plato's Form: Toward Backdoor Defense-as-a-Service for LLMs with Prototype Representations

About

Large language models (LLMs) are increasingly deployed in security-sensitive applications, yet remain vulnerable to backdoor attacks. However, existing backdoor defenses are difficult to operationalize for Backdoor Defense-as-a-Service (BDaaS), as they require unrealistic side information (e.g., downstream clean data, known triggers/targets, or task domain specifics), and lack reusable, scalable purification across diverse backdoored models. In this paper, we present PROTOPURIFY, a backdoor purification framework via parameter edits under minimal assumptions. PROTOPURIFY first builds a backdoor vector pool from clean and backdoored model pairs, aggregates vectors into candidate prototypes, and selects the most aligned candidate for the target model via similarity matching. PROTOPURIFY then identifies a boundary layer through layer-wise prototype alignment and performs targeted purification by suppressing prototype-aligned components in the affected layers, achieving fine-grained mitigation with minimal impact on benign utility. Designed as a BDaaS-ready primitive, PROTOPURIFY supports reusability, customizability, interpretability, and runtime efficiency. Experiments across various LLMs on both classification and generation tasks show that PROTOPURIFY consistently outperforms 6 representative defenses against 6 diverse attacks, including single-trigger, multi-trigger, and triggerless backdoor settings. PROTOPURIFY reduces ASR to below 10%, and even as low as 1.6% in some cases, while incurring less than a 3% drop in clean utility. PROTOPURIFY further demonstrates robustness against adaptive backdoor variants and stability on non-backdoored models.

Chen Chen, Yuchen Sun, Jiaxin Gao, Yanwen Jia, Xueluan Gong, Qian Wang, Kwok-Yan Lam• 2026

Related benchmarks

TaskDatasetResultRank
Text GenerationAutoPoison Generation Llama3-8B Mistral-7B (test)
ASR8
16
Text GenerationDTBA Llama3-8B Mistral-7B (test)
ASR8.5
16
Text GenerationVPI Generation Tasks Llama3-8B Mistral-7B (test)
ASR9
16
ClassificationEmotion
ASR18.1
15
ClassificationSST-2
ASR Error4.4
8
ClassificationCOLA
ASR Score0.175
8
ClassificationMNLI
ASR33.1
8
ClassificationQQP
ASR20
8
Showing 8 of 8 rows

Other info

Follow for update