Universal Self-Consistency for Large Language Model Generation

About

Self-consistency with chain-of-thought prompting (CoT) has demonstrated remarkable performance gains on various challenging tasks, by utilizing multiple reasoning paths sampled from large language models (LLMs). However, self-consistency relies on the answer extraction process to aggregate multiple solutions, which is not applicable to free-form answers. In this work, we propose Universal Self-Consistency (USC), which leverages LLMs themselves to select the most consistent answer among multiple candidates. We evaluate USC on a variety of benchmarks, including mathematical reasoning, code generation, long-context summarization, and open-ended question answering. On open-ended generation tasks where the original self-consistency method is not applicable, USC effectively utilizes multiple samples and improves the performance. For mathematical reasoning, USC matches the standard self-consistency performance without requiring the answer formats to be similar. Finally, without access to execution results, USC also matches the execution-based voting performance on code generation.

Xinyun Chen, Renat Aksitov, Uri Alon, Jie Ren, Kefan Xiao, Pengcheng Yin, Sushant Prakash, Charles Sutton, Xuezhi Wang, Denny Zhou• 2023

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K	Accuracy73.9	1398
Mathematical Reasoning	AIME 2024	Accuracy84.2	479
Reasoning	MMLU-Pro	Accuracy78.01	241
Reasoning	GPQA Diamond	Accuracy63.64	185
Question Answering	NQ (test)	--	133
Instruction Following	IFEval	Accuracy (IFEval)90.39	89
Truthfulness	TruthfulQA	Truthfulness Accuracy77.11	86
Mathematical Problem Solving	MATH500	Accuracy90.4	83
Reward Modeling	RewardBench Focus 2	Accuracy61.2	82
Reward Modeling	RewardBench Precise IF 2	--	70

Showing 10 of 45 rows

Other info

Follow for update

@wizwand_team Discord