The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

About

Large Language Models (LLMs) have impressive capabilities, but are prone to outputting falsehoods. Recent work has developed techniques for inferring whether a LLM is telling the truth by training probes on the LLM's internal activations. However, this line of work is controversial, with some authors pointing out failures of these probes to generalize in basic ways, among other conceptual issues. In this work, we use high-quality datasets of simple true/false statements to study in detail the structure of LLM representations of truth, drawing on three lines of evidence: 1. Visualizations of LLM true/false statement representations, which reveal clear linear structure. 2. Transfer experiments in which probes trained on one dataset generalize to different datasets. 3. Causal evidence obtained by surgically intervening in a LLM's forward pass, causing it to treat false statements as true and vice versa. Overall, we present evidence that at sufficient scale, LLMs linearly represent the truth or falsehood of factual statements. We also show that simple difference-in-mean probes generalize as well as other probing techniques while identifying directions which are more causally implicated in model outputs.

Samuel Marks, Max Tegmark• 2023

Related benchmarks

Task	Dataset	Result
Hallucination Detection	TriviaQA	AUROC0.6291	621
Hallucination Detection	TriviaQA (test)	AUC-ROC62.91	243
Hallucination Detection	CoQA	Mean AUROC0.6161	107
Hallucination Detection	RAGTruth (test)	AUROC0.6191	99
Hallucination Detection	MATH	Mean AUROC59.58	72
Claim Verification	9-dataset aggregate retrieval-free setting (test)	ROC-AUC73.3	70
Harmfulness Detection	BeaverTails bottom 30% uncertainty slice (test)	AUROC76.6	70
Harmfulness Detection	BeaverTails, ToxicChat, PKU-SafeRLHF Hard (test)	AUROC87.8	63
Harmfulness Detection	Full in-distribution (test)	AUROC0.93	63
Hallucination Detection	CommonsenseQA	Mean AUROC0.5468	62

Showing 10 of 39 rows

Other info

Follow for update

@wizwand_team Discord