ATTNPO: Attention-Guided Process Supervision for Efficient Reasoning

About

Large reasoning models trained with reinforcement learning and verifiable rewards (RLVR) achieve strong performance on complex reasoning tasks, yet often overthink, generating redundant reasoning without performance gains. Existing trajectory-level length penalties often fail to effectively shorten reasoning length and degrade accuracy, as they uniformly treat all reasoning steps and lack fine-grained signals to distinguish redundancy from necessity. Meanwhile, process-supervised methods are typically resource-intensive and suffer from inaccurate credit assignment. To address these issues, we propose ATTNPO, a low-overhead process-supervised RL framework that leverages the model's intrinsic attention signals for step-level credit assignment. We first identify a set of special attention heads that naturally focus on essential steps while suppressing redundant ones. By leveraging the attention scores of these heads, We then employ two sub-strategies to mitigate overthinking by discouraging redundant steps while preserving accuracy by reducing penalties on essential steps. Experimental results show that ATTNPO substantially reduces reasoning length while significantly improving performance across 9 benchmarks.

Shuaiyi Nie, Siyu Ding, Wenyuan Zhang, Linhao Yu, Tianmeng Yang, Yao Chen, Weichong Yin, Yu Sun, Hua Wu, Tingwen Liu• 2026

Related benchmarks

Task	Dataset	Result
Math Reasoning	GSM8K	Accuracy92.4	126
Code Reasoning	LiveCodeBench	Accuracy52.3	90
Math Reasoning	MATH 500	Accuracy92.8	60
Math Reasoning	AIME 2025	Accuracy38.1	49
Math Reasoning	OlympiadBench	Accuracy68.7	44
Math Reasoning	AIME 2024	Accuracy0.572	37
Math Reasoning	AMC 2023	Accuracy89.6	26
Science Reasoning	MMLU	Accuracy69.8	6

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord