Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Heterogeneous Adaptive Policy Optimization: Tailoring Optimization to Every Token's Nature

About

Using entropy as a measure of heterogeneity to guide optimization has emerged as a crucial research direction in Reinforcement Learning for LLMs. However, existing methods typically treat it as a discrete filter or post-hoc regulator rather than a core optimization driver. To fully leverage the potential of entropy and achieve fine-grained regulation, we introduce Heterogeneous Adaptive Policy Optimization (HAPO), a token-aware algorithm that continuously adapts optimization dynamics based on token-level entropy throughout the entire training process. Our algorithm includes four key components: (1) Adaptive Temperature Sampling that adjusts sampling temperature in real time, promoting exploration at high-entropy tokens. (2) Token-Level Group Average Advantage Estimation that estimates advantages at token level, accounting for sequence-length effects while preserving non-biased treatment.(3) Differential Advantage Redistribution that leverages entropy and importance ratios to adjust advantages for tokens with clear signals. (4) Asymmetric Adaptive Clipping that adynamically adjusts clipping boundaries based on token-level entropy. Through systematic investigation of entropy, we embed token-level treatment into every stage. Extensive experiments on mathematical reasoning, code, and logic tasks across multiple models demonstrate HAPO's consistent superiority over DAPO. Our code can be found in https://github.com/starriver030515/HAPO.

Zheng Liu, Mengjie Liu, Siwei Wen, Mengzhang Cai, Bin Cui, Conghui He, Wentao Zhang• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH 500
Accuracy (Acc)92.47
543
Mathematical ReasoningAIME 2024
Accuracy49.74
479
Mathematical ReasoningAMC
Accuracy (%)89.66
368
Mathematical ReasoningAIME 2025
Accuracy41.71
311
Mathematical Problem SolvingAIME 2024
Accuracy19.17
113
Mathematical ReasoningMath Benchmarks Aggregate
Accuracy (Avg)27.42
62
Mathematical ReasoningMinerva 272
Accuracy (Minerva 272)42.53
28
Mathematical ReasoningOlympiadBench 675
Accuracy56.43
28
E-commerce product search and purchaseWebshop
Strict Success22.7
19
Spatial planningSokoban
Success Rate26.2
19
Showing 10 of 25 rows

Other info

Follow for update