Think in Parallel, Answer as One: Logit Averaging for Open-Ended Reasoning

About

Majority voting has proven effective for close-ended question answering by aggregating parallel reasoning traces. However, it is not directly applicable to open-ended reasoning, such as code generation and web-based deep research, where a "majority" over complete solutions is ill-defined. We introduce ThinkMerge, a training-free, plug-and-play decoding strategy that runs K parallel reasoning traces and averages their next-token logits at synchronization points to produce a single coherent output. ThinkMerge integrates seamlessly with vLLM/SGLang and remains compatible with standard decoding techniques such as Top-p/Top-k. Empirically, it matches or surpasses majority voting on AIME and GPQA, while delivering consistent gains on open-ended coding tasks: on LiveCodeBench (hard), pass@1 improves by +8.28% for DeepCoder-14B-Preview and +7.58% for Qwen3-8B. Beyond code, we further show that ThinkMerge improves web-based deep-research agents (e.g., WebSailor-7B/32B) across GAIA, BrowseComp-en/zh, and XbenchDeepSearch. These results demonstrate that parallel test-time scaling can benefit open-ended reasoning without relying on voting over complete outputs.

Haonan Wang, Chao Du, Kenji Kawaguchi, Tianyu Pang• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	AIME 2025	Accuracy78.7	227
Code Generation	LiveCodeBench	Pass@172.04	89
Closed-ended reasoning	GPQA Diamond (test)	Accuracy64.1	63
Web Research	BrowseComp-EN 200	Pass@114.5	19
Web Research	BrowseComp-ZH	Pass@128.37	19
Web Research	xbench DeepSearch	Pass@157.6	18
Deep Research	GAIA	Pass@151.46	16

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord