Learning to Reason Across Parallel Samples for LLM Reasoning

About

Scaling test-time compute brings substantial performance gains for large language models (LLMs). By sampling multiple answers and heuristically aggregate their answers (e.g., either through majority voting or using verifiers to rank the answers), one can achieve consistent performance gains in math domains. In this paper, we propose a new way to leverage such multiple sample set. We train a compact LLM, called Sample Set Aggregator (SSA), that takes a concatenated sequence of multiple samples and output the final answer, optimizing it for the answer accuracy with reinforcement learning. Experiments on five reasoning datasets demonstrate both the efficacy and efficiency of SSA. Notably, SSA improves over naive majority voting by 8% pass@5 on MATH. Furthermore, our 3B SSA surpasses model-based re-ranking with a much larger 72B process reward model. Our analysis also shows promising generalization ability of SSA, across sample set sizes, base model families and scales, and tasks. By separating LLMs to generate answers and LLMs to analyze and aggregate sampled answers, our approach can work with the outputs from premier black box models easily and efficiently.

Jianing Qi, Xi Ye, Hao Tang, Zhigang Zhu, Eunsol Choi• 2025

Related benchmarks

Task	Dataset	Result
Web Browsing	Browsecomp	Accuracy70.67	68
Logical reasoning	HLE	Accuracy0.5419	62
Long-horizon agentic task	HLE	Performance54.19	41
Medical Reasoning	HealthBench Hard	Accuracy27.3	41
BrowseComp-Plus	BrowseComp+	Accuracy73.33	25
HLE	HLE	Accuracy50.97	25
Long-horizon agentic task	BrowseComp+	Performance76.67	24
Long-horizon agentic task	Browsecomp	Performance70.67	24
DeepSearchQA	DeepSearchQA	Accuracy62.67	19
Medical Question Answering	HealthBench Hard	Accuracy21.84	19

Showing 10 of 19 rows

Other info

Follow for update

@wizwand_team Discord