Latent Veracity Inference for Identifying Errors in Stepwise Reasoning

About

Chain-of-Thought (CoT) reasoning has advanced the capabilities and transparency of language models (LMs); however, reasoning chains can contain inaccurate statements that reduce performance and trustworthiness. To address this, we propose to augment each reasoning step in a CoT with a latent veracity (or correctness) variable. To efficiently explore this expanded space, we introduce Veracity Search (VS), a discrete search algorithm over veracity assignments. It performs otherwise intractable inference in the posterior distribution over latent veracity values by leveraging the LM's joint likelihood over veracity and the final answer as a proxy reward. This efficient inference-time verification method facilitates supervised fine-tuning of an Amortized Veracity Inference (AVI) machine by providing pseudo-labels for veracity. AVI generalizes VS, enabling accurate zero-shot veracity inference in novel contexts. Empirical results demonstrate that VS reliably identifies errors in logical (ProntoQA), mathematical (GSM8K), and commonsense (CommonsenseQA) reasoning benchmarks, with AVI achieving comparable zero-shot accuracy. Finally, we demonstrate the utility of latent veracity inference for providing feedback during self-correction and self-improvement.

Minsu Kim, Jean-Pierre Falet, Oliver E. Richardson, Xiaoyin Chen, Moksh Jain, Sungjin Ahn, Sungsoo Ahn, Yoshua Bengio• 2025

Related benchmarks

Task	Dataset	Result
Veracity Inference	PRONTOQA (1,000 examples)	Mean Hamming Similarity96.4	20
Veracity Inference	GSM8K 1,000 examples	Mean Hamming Similarity75.1	20
Veracity Inference	COMMONSENSEQA 1,000 examples	Mean Hamming Similarity0.935	20
Reasoning accuracy	PRONTOQA 5-hop	Accuracy81	14
Reasoning accuracy	PRONTOQA 3-hop	Accuracy87	6
Reasoning accuracy	PRONTOQA 4-hop	Accuracy85	6
Veracity Inference	PRONTOQA 3-hop (test)	Hamming Similarity95.6	4
Veracity Inference	PRONTOQA 4-hop (test)	Hamming Similarity96.7	4
Veracity Inference	PRONTOQA 5-hop (test)	Hamming Similarity0.955	4

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord