Token-wise Influential Training Data Retrieval for Large Language Models
About
Given a Large Language Model (LLM) generation, how can we identify which training data led to this generation? In this paper, we proposed RapidIn, a scalable framework adapting to LLMs for estimating the influence of each training data. The proposed framework consists of two stages: caching and retrieval. First, we compress the gradient vectors by over 200,000x, allowing them to be cached on disk or in GPU/CPU memory. Then, given a generation, RapidIn efficiently traverses the cached gradients to estimate the influence within minutes, achieving over a 6,326x speedup. Moreover, RapidIn supports multi-GPU parallelization to substantially accelerate caching and retrieval. Our empirical result confirms the efficiency and effectiveness of RapidIn.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Recall | Finance-Medical Dataset (test) | Top-5 auPRC93.61 | 37 | |
| Junk Data Detection | Brain Rot Predict Future (test) | auPRC (Top 5)89.27 | 30 | |
| Junk Data Detection | Brain Rot (test) | Top-5 auPRC85.7 | 30 | |
| Backdoor Attack Task Recall | WebQuestion howdy (test) | Top-5 auPRC0.8947 | 30 | |
| Predict Future | Finance–Medical Dataset | Top-5 auPRC94 | 30 | |
| Backdoor Attack Predict Future | Howdy! | Top-5 auPRC33.58 | 29 | |
| Data Attribution | Brain Rot Study Evaluation Suite | Brain Rot87.3 | 28 | |
| High-quality data selection | Brain Rot (test) | Top 5 auPRC0.8683 | 12 | |
| Junk Data Detection | Brain Rot junk data detection M2 (test) | Top-5 auPRC0.8583 | 12 | |
| Backdoor Attack Task Recall | WebQuestion (test) | Top 5 auPRC0.8722 | 12 |