DataDignity: Training Data Attribution for Large Language Models
About
Auditing language-model outputs often requires more than judging correctness: an auditor may need to identify which source document most likely supports the knowledge expressed in a response. We study this as pinpoint provenance: given a prompt, a target-model response, and a candidate corpus, rank the documents that best support the response. We introduce FakeWiki, a controlled benchmark of 3,537 fabricated Wikipedia-style articles designed to preserve ground-truth provenance while weakening lexical shortcuts. FakeWiki includes QA probes, source-preserving paraphrases, retro-generated variants, hard anti-documents that remain topically similar while removing answer-critical facts, and five query conditions: clean prompting plus four jailbreak-inspired transformations. We evaluate seven retrieval baselines, a training-free activation-steering retrieval-fusion method, SteerFuse, and a supervised contrastive provenance ranker, ScoringModel. ScoringModel maps response and document features into a shared space and is trained with InfoNCE using in-batch, retrieval-mined, and anti-document negatives. Across nine open-weight instruction-tuned LLMs and five query conditions, ScoringModel improves mean Recall@10 from 35.0 for the strongest retrieval baseline to 52.2, without inference-time fusion, and wins 41/45 model-by-condition cells. SteerFuse is usually second-best despite requiring no supervised training, showing that activation-space evidence can efficiently complement text retrieval. On jailbreak-inspired transformed queries, ScoringModel improves Recall@10 by 15.7 points on average over the best baseline. Overall, our work shows that robust training data attribution requires evaluation settings that separate true answer support from topical or lexical resemblance.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Document Retrieval | Transformed Query Conditions excluding clean prompts (average) | Recall@1051.1 | 27 | |
| Evidence Retrieval | FAKEWIKI Clean | Recall@1078.1 | 26 | |
| Evidence Retrieval | FAKEWIKI Obfuscate | Recall@1059.5 | 26 | |
| Evidence Retrieval | FAKEWIKI RolePlay | Recall@1063.8 | 26 | |
| Evidence Retrieval | FAKEWIKI NoiseInjection | Recall@1063.5 | 26 | |
| Evidence Retrieval | FAKEWIKI Indirect | Recall@1021.6 | 26 | |
| Document Retrieval | 45 model-by-query-condition cells | Wins41 | 3 | |
| Training Data Attribution | FAKEWIKI Clean | Recall@1077.2 | 3 | |
| Training Data Attribution | FAKEWIKI Obfuscate | Recall@1044.4 | 3 | |
| Training Data Attribution | FAKEWIKI RolePlay | Recall@1062.5 | 3 |