Heterogeneous Adaptive Policy Optimization: Tailoring Optimization to Every Token's Nature

About

Using entropy as a measure of heterogeneity to guide optimization has emerged as a crucial research direction in Reinforcement Learning for LLMs. However, existing methods typically treat it as a discrete filter or post-hoc regulator rather than a core optimization driver. To fully leverage the potential of entropy and achieve fine-grained regulation, we introduce Heterogeneous Adaptive Policy Optimization (HAPO), a token-aware algorithm that continuously adapts optimization dynamics based on token-level entropy throughout the entire training process. Our algorithm includes four key components: (1) Adaptive Temperature Sampling that adjusts sampling temperature in real time, promoting exploration at high-entropy tokens. (2) Token-Level Group Average Advantage Estimation that estimates advantages at token level, accounting for sequence-length effects while preserving non-biased treatment.(3) Differential Advantage Redistribution that leverages entropy and importance ratios to adjust advantages for tokens with clear signals. (4) Asymmetric Adaptive Clipping that adynamically adjusts clipping boundaries based on token-level entropy. Through systematic investigation of entropy, we embed token-level treatment into every stage. Extensive experiments on mathematical reasoning, code, and logic tasks across multiple models demonstrate HAPO's consistent superiority over DAPO. Our code can be found in https://github.com/starriver030515/HAPO.

Zheng Liu, Mengjie Liu, Siwei Wen, Mengzhang Cai, Bin Cui, Conghui He, Wentao Zhang• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH 500	Accuracy (Acc)92.47	600
Mathematical Reasoning	AIME 2024	Accuracy49.74	525
Mathematical Reasoning	AMC	Accuracy (%)89.66	375
Mathematical Reasoning	AIME 2025	Accuracy41.71	353
Mathematical Problem Solving	AIME 2024	Accuracy19.17	113
Mathematical Reasoning	Math Benchmarks Aggregate	Accuracy (Avg)27.42	62
Mathematical Reasoning	Minerva 272	Accuracy (Minerva 272)42.53	28
Mathematical Reasoning	OlympiadBench 675	Accuracy56.43	28
E-commerce product search and purchase	Webshop	Strict Success22.7	27
Spatial planning	Sokoban	Success Rate26.2	19

Showing 10 of 25 rows

Other info

Follow for update

@wizwand_team Discord