Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models

About

Hallucinations in large language models (LLMs) produce fluent continuations that are not supported by the prompt, especially under minimal contextual cues and ambiguity. We introduce Distributional Semantics Tracing (DST), a model-native method that builds layer-wise semantic maps at the answer position by decoding residual-stream states through the unembedding, selecting a compact top-$K$ concept set, and estimating directed concept-to-concept support via lightweight causal tracing. Using these traces, we test a representation-level hypothesis: hallucinations arise from correlation-driven representational drift across depth, where the residual stream is pulled toward a locally coherent but context-inconsistent concept neighborhood reinforced by training co-occurrences. On Racing Thoughts dataset, DST yields more faithful explanations than attribution, probing, and intervention baselines under an LLM-judge protocol, and the resulting Contextual Alignment Score (CAS) strongly predicts failures, supporting this drift hypothesis.

Gagan Bhatia, Somayajulu G Sripada, Kevin Allan, Jacobo Azcona• 2025

Related benchmarks

TaskDatasetResultRank
Faithfulness EvaluationHalogen
CODE73
20
Faithfulness EvaluationRacing Thoughts
Faithfulness (SmolLM2 135M)0.72
10
Showing 2 of 2 rows

Other info

Follow for update