Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning

About

We introduce Native Parallel Reasoner (NPR), a teacher-free framework that enables Large Language Models (LLMs) to self-evolve genuine parallel reasoning capabilities. NPR transforms the model from sequential emulation to native parallel cognition through three key innovations: 1) a self-distilled progressive training paradigm that transitions from ``cold-start'' format discovery to strict topological constraints without external supervision; 2) a novel Parallel-Aware Policy Optimization (PAPO) algorithm that optimizes branching policies directly within the execution graph, allowing the model to learn adaptive decomposition via trial and error; and 3) a robust NPR Engine that refactors memory management and flow control of SGLang to enable stable, large-scale parallel RL training. Across eight reasoning benchmarks, NPR trained on Qwen3-4B achieves performance gains of up to 24.5% and inference speedups up to 4.6x. Unlike prior baselines that often fall back to autoregressive decoding, NPR demonstrates 100% genuine parallel execution, establishing a new standard for self-evolving, efficient, and scalable agentic reasoning.

Tong Wu, Yang Liu, Jun Bai, Zixia Jia, Shuyi Zhang, Ziyong Lin, Yanting Wang, Song-Chun Zhu, Zilong Zheng• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	Minerva Math	Avg@1 Accuracy47.1	40
Reasoning	HMMT25	--	38
Mathematical Reasoning	AMC23	AVG@893.1	25
Mathematical Reasoning	HMMT25	Avg@8 Score32.9	20
Mathematical Reasoning	OlympiadBench	Pass@163.7	12
Logic reasoning	ZebraLogic	Avg Accuracy @10.817	11
Mathematical Reasoning	AIME 25	Avg@8 Score53.8	11
Mathematical Reasoning	AIME 24	Avg@863.3	11
Mathematical Reasoning	MATH500	Average Score (avg@1)93.6	11
Reasoning	AIME25	Throughput (TPS)2.98e+3	3

Showing 10 of 13 rows

Other info

GitHub

Follow for update

@wizwand_team Discord