Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Reliable Chain-of-Thought via Prefix Consistency

About

Large Language Models often improve accuracy on reasoning tasks by sampling multiple Chain-of-Thought (CoT) traces and aggregating them with majority voting (MV), a test-time technique called self-consistency. When we truncate a CoT partway through and regenerate the remainder, we observe that traces with correct answers reproduce their original answer more often than traces with wrong answers. We use this difference as a reliability signal, prefix consistency, that weights each candidate answer by how often it reappears under regeneration. It requires no access to token log-probabilities or self-rating prompts. Across five reasoning models and four math and science benchmarks, prefix consistency is the best correctness predictor in most settings, and reweighting votes by it reaches Standard MV plateau accuracy at up to 21x fewer tokens (median 4.6x). Our code is available at https://github.com/naoto-iwase/prefix-consistency.

Naoto Iwase, Yuki Ichihara, Mohammad Atif Quamar, Junpei Komiyama• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAIME 2025
Accuracy96.7
311
Mathematical ReasoningFrontierScience-Olympiad
Accuracy50.8
63
Mathematical ReasoningHMMT Feb 2026
Accuracy80.4
40
Scientific problem solvingFrontierScience-Olympiad
Token Efficiency Ratio (B_method/BMV)3.88
27
Mathematical ReasoningBRUMO 2025
Pass@196.4
8
Mathematical ReasoningAIME 2025
Token Efficiency Ratio0.05
2
Showing 6 of 6 rows

Other info

GitHub

Follow for update