DataDignity: Training Data Attribution for Large Language Models

About

Auditing language-model outputs often requires more than judging correctness: an auditor may need to identify which source document most likely supports the knowledge expressed in a response. We study this as pinpoint provenance: given a prompt, a target-model response, and a candidate corpus, rank the documents that best support the response. We introduce FakeWiki, a controlled benchmark of 3,537 fabricated Wikipedia-style articles designed to preserve ground-truth provenance while weakening lexical shortcuts. FakeWiki includes QA probes, source-preserving paraphrases, retro-generated variants, hard anti-documents that remain topically similar while removing answer-critical facts, and five query conditions: clean prompting plus four jailbreak-inspired transformations. We evaluate seven retrieval baselines, a training-free activation-steering retrieval-fusion method, SteerFuse, and a supervised contrastive provenance ranker, ScoringModel. ScoringModel maps response and document features into a shared space and is trained with InfoNCE using in-batch, retrieval-mined, and anti-document negatives. Across nine open-weight instruction-tuned LLMs and five query conditions, ScoringModel improves mean Recall@10 from 35.0 for the strongest retrieval baseline to 52.2, without inference-time fusion, and wins 41/45 model-by-condition cells. SteerFuse is usually second-best despite requiring no supervised training, showing that activation-space evidence can efficiently complement text retrieval. On jailbreak-inspired transformed queries, ScoringModel improves Recall@10 by 15.7 points on average over the best baseline. Overall, our work shows that robust training data attribution requires evaluation settings that separate true answer support from topical or lexical resemblance.

Xiaomin Li, Andrzej Banburski-Fahey, Jaron Lanier• 2026

Related benchmarks

Task	Dataset	Result
Document Retrieval	Transformed Query Conditions excluding clean prompts (average)	Recall@1051.1	27
Evidence Retrieval	FAKEWIKI Clean	Recall@1078.1	26
Evidence Retrieval	FAKEWIKI Obfuscate	Recall@1059.5	26
Evidence Retrieval	FAKEWIKI RolePlay	Recall@1063.8	26
Evidence Retrieval	FAKEWIKI NoiseInjection	Recall@1063.5	26
Evidence Retrieval	FAKEWIKI Indirect	Recall@1021.6	26
Document Retrieval	45 model-by-query-condition cells	Wins41	3
Training Data Attribution	FAKEWIKI Clean	Recall@1077.2	3
Training Data Attribution	FAKEWIKI Obfuscate	Recall@1044.4	3
Training Data Attribution	FAKEWIKI RolePlay	Recall@1062.5	3

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord