Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Soft Self-Consistency Improves Language Model Agents

About

Generations from large language models (LLMs) can be improved by sampling and scoring multiple solutions to select a final answer. Current "sample and select" methods such as self-consistency (SC) rely on majority voting to score answers. However, when tasks have many distinct and valid answers, selection by voting requires a large number of samples. This makes SC prohibitively expensive for interactive tasks that involve generating multiple actions (answers) sequentially. After establishing that majority voting fails to provide consistent gains on such tasks, we demonstrate how to increase success rates by softening the scoring criterion. We introduce Soft Self-Consistency (SOFT-SC), which replaces SC's discontinuous scoring with a continuous score computed from model likelihoods, allowing for selection even when actions are sparsely distributed. SOFT-SC improves both performance and efficiency on long-horizon interactive tasks, requiring half as many samples as SC for comparable or better performance. For a fixed number of samples, SOFT-SC leads to a 1.3% increase over SC in absolute success rate on writing bash programs, a 6.6% increase on online shopping (WebShop), and a 4.7% increase for an interactive household game (ALFWorld). Finally, we show that SOFT-SC can be applied to both open-source and black-box models.

Han Wang, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal• 2024

Related benchmarks

TaskDatasetResultRank
Reward ModelingRewardBench Focus 2
Accuracy66.3
82
Reward ModelingRewardBench Precise IF 2--
70
Reward Modeling EvaluationReward Bench Factuality 2
Pairwise Accuracy41
64
Translation Preference PredictionWMT en-de
Pairwise Acc49.6
12
Reward Modeling EvaluationReward Bench Math 2
Pairwise Accuracy65.4
12
Reward Modeling EvaluationReward Bench Safety 2
Pairwise Accuracy63.5
12
Machine Translation EvaluationWMT 2023 (test)
MAE (EN→DE)0.664
12
Reward Modeling EvaluationReward Bench Ties 2
Pairwise Accuracy82.3
12
Translation Preference PredictionWMT Zh-En
Pairwise Accuracy52.9
12
Reward Model EvaluationReward Bench 2 (test)
RB2 Factuality MAE0.681
12
Showing 10 of 10 rows

Other info

Follow for update