Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks

About

Large language models (LLMs) are highly susceptible to backdoor attacks (BAs), wherein training samples are poisoned using trigger-based harmful content. Furthermore, existing defenses have proven ineffective when extensively tested across BA patterns. To better combat BAs, we explore the use of LLM rewriting as a proactive defense against data poisoning. First, we theoretically show that when LLM rewriting utilizes open-book benign samples--termed open-book benign rewriting (OBBR)--the probability of a rewritten output being benign is strictly greater than that of closed-book rewriting. Thus, OBBR neutralizes harmful content by projecting training samples to the space of benign prompts. We then show that, in contrast to previous defenses, OBBR effectively mitigates a large number of existing BAs: across five known BAs and four widely used LLMs, OBBR increases safety performance by an average 51% compared to state-of-the-art BA defenses and 25.7% compared to closed-book rewriting methods. Finally, we show that OBBR is computationally efficient relative to other BA defenses, does not degrade model performance on natural language tasks after fine-tuning, and is capable of defending against non-trigger based data poisoning attacks.

John T. Halloran, Noopur S. Bhatt• 2026

Related benchmarks

Task	Dataset	Result
Backdoor Attack Defense	Backdoor Attacks (test)	ASR16.5	45
Jailbreak Robustness	StrongREJECT	--	30
LLM Utility Evaluation	LIMA original and rewritten variants (fine-tuning)	ARC-E Accuracy75	20

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord