Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VecCISC: Improving Confidence-Informed Self-Consistency with Reasoning Trace Clustering and Candidate Answer Selection

About

A standard technique for scaling inference-time reasoning is Self-Consistency, whereby multiple candidate answers are sampled from an LLM and the most common answer is selected. More recently, it has been shown that weighted majority voting (e.g. Confidence-Informed Self Consistency (CISC)), which assigns a confidence value to each candidate answer and chooses the answer with the largest accumulated score, tends to be more accurate on a wide range of popular benchmarks. In practice, weighted majority voting necessitates calling a critic LLM on each candidate's reasoning trace to produce the answer's confidence score. This secondary series of LLM calls greatly increases the overhead and cost of weighted majority voting, despite its potential performance benefits. To reduce this expense, we propose VecCISC, a lightweight, adaptive framework that uses a measure of semantic similarity to filter reasoning traces that are semantically equivalent to others, degenerate, or hallucinated, thus decreasing the number of candidate answers that must be evaluated by the critic. To ensure adequate experimental thoroughness, we evaluate VecCISC on five challenging, widely-adopted datasets spanning the domains of mathematics, chemistry, biology, commonsense reasoning, and the humanities. Our results demonstrate that VecCISC reduces the total token usage by 47%, while maintaining or exceeding the accuracy of CISC.

James Petullo, Sonny George, Dylan Cashman, Nianwen Xue• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAQUA-RAT
Accuracy87.7
153
Multi-task Language UnderstandingMMLU-Pro
Best Accuracy71.4
25
Expert-Level Question AnsweringGPQA
Best Accuracy61.7
25
Science Question AnsweringARC Challenging
Best Accuracy96.3
25
Commonsense ReasoningCommonsenseQA
LLMcritic Calls15.54
10
Expert-Level Question AnsweringGPQA
LLMcritic Calls17.27
5
Expert-level Science ReasoningGPQA
LLMcritic Calls18.81
5
Massive Multitask Language UnderstandingMMLU-Pro
LLMcritic Calls17.47
5
Question AnsweringARC Challenging
LLMcritic Calls15.65
5
Science Question AnsweringARC Challenging
LLMcritic Calls15.65
5
Showing 10 of 10 rows

Other info

Follow for update