Long Chain-of-Thought Compression via Fine-Grained Group Policy Optimization

About

Large Language Models (LLMs) often generate unnecessarily verbose Chain-of-Thought (CoT) reasoning that increases computational costs and latency without proportional performance gains. In this paper, we propose Fine-grained Group policy Optimization (FGO), a Reinforcement Learning (RL) algorithm that refines group responses by subdividing them and assigning appropriate weights based on length and entropy, thereby enabling effective CoT compression. Meanwhile, as an enhanced variant of Group Relative Policy Optimization (GRPO), FGO successfully addresses two major limitations of the GRPO: inefficient data utilization and entropy collapse. We evaluate FGO on multiple reasoning LLMs and benchmarks, including MATH500, AIME24, AMC23, and Minerva. Experimental results show that FGO achieves efficient CoT compression without degrading performance, and simultaneously resolves the key limitations of GRPO. Code: https://github.com/Mr-XcHan/FGO.

Xinchen Han, Hossam Afifi, Michel Marot, Xilu Wang, Lu Yin• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH500 (test)	Accuracy73.2	895
Mathematical Reasoning	AMC23 (test)	Pass@155	61
Mathematical Reasoning	Minerva (test)	Acc24.6	46

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord