Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DataDignity: Training Data Attribution for Large Language Models

About

Auditing language-model outputs often requires more than judging correctness: an auditor may need to identify which source document most likely supports the knowledge expressed in a response. We study this as pinpoint provenance: given a prompt, a target-model response, and a candidate corpus, rank the documents that best support the response. We introduce FakeWiki, a controlled benchmark of 3,537 fabricated Wikipedia-style articles designed to preserve ground-truth provenance while weakening lexical shortcuts. FakeWiki includes QA probes, source-preserving paraphrases, retro-generated variants, hard anti-documents that remain topically similar while removing answer-critical facts, and five query conditions: clean prompting plus four jailbreak-inspired transformations. We evaluate seven retrieval baselines, a training-free activation-steering retrieval-fusion method, SteerFuse, and a supervised contrastive provenance ranker, ScoringModel. ScoringModel maps response and document features into a shared space and is trained with InfoNCE using in-batch, retrieval-mined, and anti-document negatives. Across nine open-weight instruction-tuned LLMs and five query conditions, ScoringModel improves mean Recall@10 from 35.0 for the strongest retrieval baseline to 52.2, without inference-time fusion, and wins 41/45 model-by-condition cells. SteerFuse is usually second-best despite requiring no supervised training, showing that activation-space evidence can efficiently complement text retrieval. On jailbreak-inspired transformed queries, ScoringModel improves Recall@10 by 15.7 points on average over the best baseline. Overall, our work shows that robust training data attribution requires evaluation settings that separate true answer support from topical or lexical resemblance.

Xiaomin Li, Andrzej Banburski-Fahey, Jaron Lanier• 2026

Related benchmarks

TaskDatasetResultRank
Document RetrievalTransformed Query Conditions excluding clean prompts (average)
Recall@1051.1
27
Evidence RetrievalFAKEWIKI Clean
Recall@1078.1
26
Evidence RetrievalFAKEWIKI Obfuscate
Recall@1059.5
26
Evidence RetrievalFAKEWIKI RolePlay
Recall@1063.8
26
Evidence RetrievalFAKEWIKI NoiseInjection
Recall@1063.5
26
Evidence RetrievalFAKEWIKI Indirect
Recall@1021.6
26
Document Retrieval45 model-by-query-condition cells
Wins41
3
Training Data AttributionFAKEWIKI Clean
Recall@1077.2
3
Training Data AttributionFAKEWIKI Obfuscate
Recall@1044.4
3
Training Data AttributionFAKEWIKI RolePlay
Recall@1062.5
3
Showing 10 of 12 rows

Other info

Follow for update