Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Traceback of Poisoning Attacks to Retrieval-Augmented Generation

About

Large language models (LLMs) integrated with retrieval-augmented generation (RAG) systems improve accuracy by leveraging external knowledge sources. However, recent research has revealed RAG's susceptibility to poisoning attacks, where the attacker injects poisoned texts into the knowledge database, leading to attacker-desired responses. Existing defenses, which predominantly focus on inference-time mitigation, have proven insufficient against sophisticated attacks. In this paper, we introduce RAGForensics, the first traceback system for RAG, designed to identify poisoned texts within the knowledge database that are responsible for the attacks. RAGForensics operates iteratively, first retrieving a subset of texts from the database and then utilizing a specially crafted prompt to guide an LLM in detecting potential poisoning texts. Empirical evaluations across multiple datasets demonstrate the effectiveness of RAGForensics against state-of-the-art poisoning attacks. This work pioneers the traceback of poisoned texts in RAG systems, providing a practical and promising defense mechanism to enhance their security. Our code is available at: https://github.com/zhangbl6618/RAG-Responsibility-Attribution

Baolei Zhang, Haoran Xin, Minghong Fang, Zhuqing Liu, Biao Yi, Tong Li, Zheli Liu• 2025

Related benchmarks

TaskDatasetResultRank
Knowledge corruption tracebackNQ
Precision99
30
Knowledge corruption tracebackMS Marco
Precision90
26
Traceback (Prompt Injection Attacks)MuSiQue
Precision (MuSiQue Traceback)53
23
Knowledge corruption tracebackHotpotQA
Precision98
16
Traceback (Prompt Injection Attacks)QMSum
Precision60
13
Traceback (Prompt Injection Attacks)NarrativeQA
Precision47
13
Payload-splitting attack detectionMuSiQue
Precision11
6
Payload-splitting attack detectionNarrativeQA
Precision1
6
Payload-splitting attack detectionQMSum
Precision (QMSum)11
6
Showing 9 of 9 rows

Other info

Follow for update