Traceback of Poisoning Attacks to Retrieval-Augmented Generation

About

Large language models (LLMs) integrated with retrieval-augmented generation (RAG) systems improve accuracy by leveraging external knowledge sources. However, recent research has revealed RAG's susceptibility to poisoning attacks, where the attacker injects poisoned texts into the knowledge database, leading to attacker-desired responses. Existing defenses, which predominantly focus on inference-time mitigation, have proven insufficient against sophisticated attacks. In this paper, we introduce RAGForensics, the first traceback system for RAG, designed to identify poisoned texts within the knowledge database that are responsible for the attacks. RAGForensics operates iteratively, first retrieving a subset of texts from the database and then utilizing a specially crafted prompt to guide an LLM in detecting potential poisoning texts. Empirical evaluations across multiple datasets demonstrate the effectiveness of RAGForensics against state-of-the-art poisoning attacks. This work pioneers the traceback of poisoned texts in RAG systems, providing a practical and promising defense mechanism to enhance their security. Our code is available at: https://github.com/zhangbl6618/RAG-Responsibility-Attribution

Baolei Zhang, Haoran Xin, Minghong Fang, Zhuqing Liu, Biao Yi, Tong Li, Zheli Liu• 2025

Related benchmarks

Task	Dataset	Result
Knowledge corruption traceback	NQ	Precision99	30
Knowledge corruption traceback	MS Marco	Precision90	26
Traceback (Prompt Injection Attacks)	MuSiQue	Precision (MuSiQue Traceback)53	23
Knowledge corruption traceback	HotpotQA	Precision98	16
Question Answering	MS Marco	Accuracy (Clean)68	16
Question Answering	HotpotQA	Accuracy (Clean)51	16
Question Answering	NQ	Clean Accuracy (NQ)60	16
Traceback (Prompt Injection Attacks)	QMSum	Precision60	13
Traceback (Prompt Injection Attacks)	NarrativeQA	Precision47	13
SND Defense	SND Evaluation Corpus	Benign FPR0.00e+0	6

Showing 10 of 22 rows

Other info

Follow for update

@wizwand_team Discord