Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Universal Self-Consistency for Large Language Model Generation

About

Self-consistency with chain-of-thought prompting (CoT) has demonstrated remarkable performance gains on various challenging tasks, by utilizing multiple reasoning paths sampled from large language models (LLMs). However, self-consistency relies on the answer extraction process to aggregate multiple solutions, which is not applicable to free-form answers. In this work, we propose Universal Self-Consistency (USC), which leverages LLMs themselves to select the most consistent answer among multiple candidates. We evaluate USC on a variety of benchmarks, including mathematical reasoning, code generation, long-context summarization, and open-ended question answering. On open-ended generation tasks where the original self-consistency method is not applicable, USC effectively utilizes multiple samples and improves the performance. For mathematical reasoning, USC matches the standard self-consistency performance without requiring the answer formats to be similar. Finally, without access to execution results, USC also matches the execution-based voting performance on code generation.

Xinyun Chen, Renat Aksitov, Uri Alon, Jie Ren, Kefan Xiao, Pengcheng Yin, Sushant Prakash, Charles Sutton, Xuezhi Wang, Denny Zhou• 2023

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K
Accuracy73.9
983
Reward ModelingRewardBench Focus 2
Accuracy61.2
82
Reward ModelingRewardBench Precise IF 2--
70
Question AnsweringNQ (test)--
66
Reward Modeling EvaluationReward Bench Factuality 2
Pairwise Accuracy47.5
64
Long-form Question Answering with CitationsASQA
EM42.75
37
Trivia QATrivia QA--
32
Question AnsweringTruthful QA
Info Accuracy99.2
27
Question AnsweringNQ-Open
Exact Match (EM)38.6
24
Workflow ExtractionSynthABCD
Macro Score84.31
24
Showing 10 of 24 rows

Other info

Follow for update