AttnTrace: Contextual Attribution of Prompt Injection and Knowledge Corruption
About
Long-context large language models (LLMs), such as Gemini-2.5-Pro and Claude-Sonnet-4, are increasingly used to empower advanced AI systems, including retrieval-augmented generation (RAG) pipelines and autonomous agents. In these systems, an LLM receives an instruction along with a context--often consisting of texts retrieved from a knowledge database or memory--and generates a response that is contextually grounded by following the instruction. Recent studies have designed solutions to trace back to a subset of texts in the context that contributes most to the response generated by the LLM. These solutions have numerous real-world applications, including performing post-attack forensic analysis and improving the interpretability and trustworthiness of LLM outputs. While significant efforts have been made, state-of-the-art solutions such as TracLLM often lead to a high computation cost, e.g., it takes TracLLM hundreds of seconds to perform traceback for a single response-context pair. In this work, we propose AttnTrace, a new context traceback method based on the attention weights produced by an LLM for a prompt. To effectively utilize attention weights, we introduce two techniques designed to enhance the effectiveness of AttnTrace, and we provide theoretical insights for our design choice. We also perform a systematic evaluation for AttnTrace. The results demonstrate that AttnTrace is more accurate and efficient than existing state-of-the-art context traceback methods. We also show that AttnTrace can improve state-of-the-art methods in detecting prompt injection under long contexts through the attribution-before-detection paradigm. As a real-world application, we demonstrate that AttnTrace can effectively pinpoint injected instructions in a paper designed to manipulate LLM-generated reviews. The code is at https://github.com/Wang-Yanting/AttnTrace.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Knowledge corruption traceback | NQ | Precision98 | 30 | |
| Knowledge corruption traceback | MS Marco | Precision95 | 26 | |
| Traceback (Prompt Injection Attacks) | MuSiQue | Precision (MuSiQue Traceback)99 | 23 | |
| Knowledge corruption traceback | HotpotQA | Precision95 | 16 | |
| Traceback (Prompt Injection Attacks) | NarrativeQA | Precision96 | 13 | |
| Traceback (Prompt Injection Attacks) | QMSum | Precision99 | 13 | |
| Prompt Injection Attack Tracing | MuSiQue | Precision75 | 12 | |
| Context Traceback | QMSum LongBench | Precision99 | 10 | |
| Knowledge corruption attack | HotpotQA | Precision99 | 10 | |
| Context Traceback | NarrativeQA LongBench | Precision96 | 10 |