ReThinker: Scientific Reasoning by Rethinking with Guided Reflection and Confidence Control

About

Expert-level scientific reasoning remains challenging for large language models, particularly on benchmarks such as Humanity's Last Exam (HLE), where rigid tool pipelines, brittle multi-agent coordination, and inefficient test-time scaling often limit performance. We introduce ReThinker, a confidence-aware agentic framework that orchestrates retrieval, tool use, and multi-agent reasoning through a stage-wise Solver-Critic-Selector architecture. Rather than following a fixed pipeline, ReThinker dynamically allocates computation based on model confidence, enabling adaptive tool invocation, guided multi-dimensional reflection, and robust confidence-weighted selection. To support scalable training without human annotation, we further propose a reverse data synthesis pipeline and an adaptive trajectory recycling strategy that transform successful reasoning traces into high-quality supervision. Experiments on HLE, GAIA, and XBench demonstrate that ReThinker consistently outperforms state-of-the-art foundation models with tools and existing deep research systems, achieving state-of-the-art results on expert-level reasoning tasks.

Zhentao Tang, Yuqi Cui, Shixiong Kai, Wenqian Zhao, Ke Ye, Xing Li, Anxin Tian, Zehua Pei, Hui-Ling Zhen, Shoubo Hu, Xiaoguang Li, Yunhe Wang, Mingxuan Yuan• 2026

Related benchmarks

Task	Dataset	Result
Expert-Level Reasoning	HLE (Humanity's Last Exam) text-only subset (val)	Inference Accuracy52.2	13
Expert-Level Reasoning	GAIA text-only (val)	Inference Accuracy81.6	12
Expert-Level Reasoning	XBench-DeepSearch 1.0 (test)	Inference Accuracy0.9	12

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord