Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PCL-Reasoner-V1.5: Advancing Math Reasoning with Offline Reinforcement Learning

About

We present PCL-Reasoner-V1.5, a 32-billion-parameter large language model (LLM) for mathematical reasoning. The model is built upon Qwen2.5-32B and refined via supervised fine-tuning (SFT) followed by reinforcement learning (RL). A central innovation is our proposed offline RL method, which provides superior training stability and efficiency over standard online RL methods such as GRPO. Our model achieves state-of-the-art performance among models post-trained on Qwen2.5-32B, attaining average accuracies of 90.9% on AIME 2024 and 85.6% on AIME 2025. Our work demonstrates offline RL as a stable and efficient paradigm for advancing reasoning in LLMs. All experiments were conducted on Huawei Ascend 910C NPUs.

Yao Lu, Dengdong Fan, Jianzheng Nie, Fan Xu, Jie Chen, Bin Zhou, Yonghong Tian• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAIME 2024 (test)--
159
Mathematical ReasoningMATH 500
Pass@1 Rate68.7
76
Mathematical ReasoningAIME 2025 (test)
Pass@1 Rate85.6
63
Mathematical ReasoningAMC
Pass@1 Accuracy42
61
Mathematical ReasoningAIME
Pass@111.8
44
Mathematical ReasoningMATH-P
Pass@143.9
24
Showing 6 of 6 rows

Other info

Follow for update