Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Universal Self-Consistency for Large Language Model Generation

About

Self-consistency with chain-of-thought prompting (CoT) has demonstrated remarkable performance gains on various challenging tasks, by utilizing multiple reasoning paths sampled from large language models (LLMs). However, self-consistency relies on the answer extraction process to aggregate multiple solutions, which is not applicable to free-form answers. In this work, we propose Universal Self-Consistency (USC), which leverages LLMs themselves to select the most consistent answer among multiple candidates. We evaluate USC on a variety of benchmarks, including mathematical reasoning, code generation, long-context summarization, and open-ended question answering. On open-ended generation tasks where the original self-consistency method is not applicable, USC effectively utilizes multiple samples and improves the performance. For mathematical reasoning, USC matches the standard self-consistency performance without requiring the answer formats to be similar. Finally, without access to execution results, USC also matches the execution-based voting performance on code generation.

Xinyun Chen, Renat Aksitov, Uri Alon, Jie Ren, Kefan Xiao, Pengcheng Yin, Sushant Prakash, Charles Sutton, Xuezhi Wang, Denny Zhou• 2023

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K
Accuracy73.9
1398
Mathematical ReasoningAIME 2024
Accuracy84.2
479
ReasoningMMLU-Pro
Accuracy78.01
241
ReasoningGPQA Diamond
Accuracy63.64
185
Question AnsweringNQ (test)--
133
Instruction FollowingIFEval
Accuracy (IFEval)90.39
89
TruthfulnessTruthfulQA
Truthfulness Accuracy77.11
86
Mathematical Problem SolvingMATH500
Accuracy90.4
83
Reward ModelingRewardBench Focus 2
Accuracy61.2
82
Reward ModelingRewardBench Precise IF 2--
70
Showing 10 of 45 rows

Other info

Follow for update