Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Sketching the Readout of Large Language Models for Scalable Data Attribution and Valuation

About

Data attribution and valuation are critical for understanding data-model synergy for Large Language Models (LLMs), yet existing gradient-based methods suffer from scalability challenges on LLMs. Inspired by human cognition, where decision making relies on a focused readout of relevant memories rather than replaying all pathways, we introduce RISE (Readout Influence Sketching Estimator). Instead of computing and indexing gradients across the entire LLM, RISE focuses on influence hotspots at the output layer, where influence signals concentrate, and the gradient admits a decomposed outer-product form. This enables a dual-channel representation combining a lexical residual channel (RH) and a semantic projected-error channel (GH). Applying CountSketch projections to these channels achieves strong compression while maintaining accurate attribution. Across the OLMo (1B-32B) and Pythia (14M-6.9B) families, RISE reduces index storage by up to 112$\times$ compared to RapidIn and scales to 32B parameters LLM, where gradient-based baselines such as RapidIn and ZO-Inf become memory-infeasible. We evaluate RISE on two paradigms: (1) retrospective attribution, retrieving influential training examples for specific predictions, and (2) prospective valuation, scoring candidate data utility zero-shot. We validate RISE on three tasks: Howdy backdoor data detection, Finance-Medical domain separation, and Brain Rot high-quality data selection. In a closed-loop Brain Rot study, continued pretraining on RISE-selected data yields consistent downstream improvements. Overall, RISE provides a practical and scalable primitive for influence analysis and training-data selection in modern large language models.

Yide Ran, Jianwen Xie, Minghui Wang, Wenjin Zheng, Denghui Zhang, Chuan Li, Zhaozhuo Xu• 2026

Related benchmarks

TaskDatasetResultRank
Question AnsweringARC Challenge
Accuracy (ARC)26.62
598
RecallFinance-Medical Dataset (test)
Top-5 auPRC97.54
37
Backdoor Attack Task RecallWebQuestion howdy (test)
Top-5 auPRC1
30
Junk Data DetectionBrain Rot (test)
Top-5 auPRC86.14
30
Predict FutureFinance–Medical Dataset
Top-5 auPRC99.15
30
Junk Data DetectionBrain Rot Predict Future (test)
auPRC (Top 5)86.7
30
Backdoor Attack Predict FutureHowdy!
Top-5 auPRC100
29
Data AttributionBrain Rot Study Evaluation Suite
Brain Rot83.6
28
Backdoor Attack Task Predict FutureWebQuestion Howdy (Alpaca-howdy-52K distribution) (test)
Top-5 auPRC100
12
Backdoor Attack Task RecallWebQuestion (test)
Top 5 auPRC1
12
Showing 10 of 28 rows

Other info

Follow for update