Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Latent Veracity Inference for Identifying Errors in Stepwise Reasoning

About

Chain-of-Thought (CoT) reasoning has advanced the capabilities and transparency of language models (LMs); however, reasoning chains can contain inaccurate statements that reduce performance and trustworthiness. To address this, we propose to augment each reasoning step in a CoT with a latent veracity (or correctness) variable. To efficiently explore this expanded space, we introduce Veracity Search (VS), a discrete search algorithm over veracity assignments. It performs otherwise intractable inference in the posterior distribution over latent veracity values by leveraging the LM's joint likelihood over veracity and the final answer as a proxy reward. This efficient inference-time verification method facilitates supervised fine-tuning of an Amortized Veracity Inference (AVI) machine by providing pseudo-labels for veracity. AVI generalizes VS, enabling accurate zero-shot veracity inference in novel contexts. Empirical results demonstrate that VS reliably identifies errors in logical (ProntoQA), mathematical (GSM8K), and commonsense (CommonsenseQA) reasoning benchmarks, with AVI achieving comparable zero-shot accuracy. Finally, we demonstrate the utility of latent veracity inference for providing feedback during self-correction and self-improvement.

Minsu Kim, Jean-Pierre Falet, Oliver E. Richardson, Xiaoyin Chen, Moksh Jain, Sungjin Ahn, Sungsoo Ahn, Yoshua Bengio• 2025

Related benchmarks

TaskDatasetResultRank
Veracity InferencePRONTOQA (1,000 examples)
Mean Hamming Similarity96.4
20
Veracity InferenceGSM8K 1,000 examples
Mean Hamming Similarity75.1
20
Veracity InferenceCOMMONSENSEQA 1,000 examples
Mean Hamming Similarity0.935
20
Reasoning accuracyPRONTOQA 3-hop
Accuracy87
6
Reasoning accuracyPRONTOQA 4-hop
Accuracy85
6
Reasoning accuracyPRONTOQA 5-hop
Accuracy81
6
Veracity InferencePRONTOQA 3-hop (test)
Hamming Similarity95.6
4
Veracity InferencePRONTOQA 4-hop (test)
Hamming Similarity96.7
4
Veracity InferencePRONTOQA 5-hop (test)
Hamming Similarity0.955
4
Showing 9 of 9 rows

Other info

Follow for update