Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Probing for Knowledge Attribution in Large Language Models

About

Large language model (LLM) hallucinations, meaning fluent but factually incorrect generations, fall into two types: faithfulness violations, where the model misuses provided context, and factuality violations, where answers reflect errors in internal knowledge. Proper mitigation depends on knowing which source drives each answer. We study contributive attribution, i.e. the classification of the dominant knowledge source behind each output, and show that a simple linear probe trained on hidden representations can reliably identify it. We introduce AttriWiki, a self-supervised pipeline that automatically generates labelled training data by prompting models to recall withheld entities from memory or read them from context without relying on knowledge conflicts. Probes trained on AttriWiki achieve up to 0.96 Macro-$F_1$ on Llama-3.1-8B, Mistral-7B, and Qwen-7B, transfer to SQuAD and WebQuestions with 0.94-0.99 Macro-$F_1$, and generalise zero-shot to Tighidet et al. (2024)'s benchmark, outperforming their probe on conflicting settings without retraining. Furthermore, attribution mismatches raise error rates by up to 70%, though correct attribution does not guarantee correct answers, pointing to the need for broader detection frameworks.

Ivo Brink, Alexander Boer, Dennis Ulmer• 2026

Related benchmarks

TaskDatasetResultRank
Question AnsweringSQuAD
Accuracy99.6
32
Question AnsweringWebQuestions
WebQ Accuracy99.6
14
Fact AttributionPARAREL never seen before
Macro-F180.6
6
Question AnsweringSQuAD (out-of-domain)
Accuracy99.8
3
Question AnsweringWebQuestions (out-of-domain)
Accuracy98
3
Showing 5 of 5 rows

Other info

Follow for update