Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CachePrune: Teaching LLMs What Not to Follow via KV-Cache Editing

About

Large Language Models (LLMs) are susceptible to indirect prompt injection attacks, where the model inadvertently responds to instructions injected into the prompt context. This vulnerability stems from LLMs' inability to distinguish between data and instructions within a prompt. We propose CachePrune, which defends against this attack by identifying and pruning neurons associated with instruction-following during KV cache encoding of the prompt context. The pruning steers the LLM toward interpreting the context purely as data rather than as instructions to follow. To identify these neurons, we introduce a neural attribution mechanism guided by a preferential attribution loss, and theoretically connect this loss to an upper bound of the Direct Preference Optimization (DPO) objective. Further, we improve the fidelity of neural attribution by leveraging an observed triggering effect in instruction-following. Our approach does not interfere with prompt formatting or incur test-time overhead during response generation. Experiments show that CachePrune significantly reduces the attack success rate while preserving the LLM's ability to follow user instructions.

Rui Wang, Junda Wu, Yu Xia, Tong Yu, Ruiyi Zhang, Ryan Rossi, Subrata Mitra, Lina Yao, Julian McAuley• 2025

Related benchmarks

TaskDatasetResultRank
Indirect Prompt Injection DefenseSQuAD
ASR0.68
22
Indirect Prompt Injection DefenseWildChat
Attack Success Rate (ASR)0.33
18
Indirect Prompt Injection DefenseHotpotQA
Attack Success Rate (ASR)1.76
18
Adaptive AttackSQuAD
ASR1.35
4
Prompt Injection AttackSQuAD
Attack Success Rate (ASR)1.55
4
Showing 5 of 5 rows

Other info

Follow for update