LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization
About
Large reasoning models have achieved remarkable performance through extended chain-of-thought sequences, yet this computational freedom leads to excessive token generation even for simple problems. We present Length-Adaptive Policy Optimization (LAPO), a novel framework that transforms reasoning length control from an external constraint into an intrinsic model capability. Unlike existing approaches that impose rigid limits or rely on post-hoc interventions, LAPO enables models to internalize an understanding of appropriate reasoning depth through a two-stage reinforcement learning process. In the first stage, models learn natural reasoning patterns by discovering the statistical distribution of successful solution lengths. The second stage leverages these patterns as meta-cognitive guidance, embedding them directly within the model's reasoning context to ensure inference-time flexibility. Experiments on mathematical reasoning benchmarks demonstrate that LAPO reduces token usage by up to 40.9% while improving accuracy by 2.3%. Our analysis reveals that models trained with LAPO develop emergent abilities to allocate computational resources based on problem complexity, achieving efficient reasoning without sacrificing quality.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | AIME 24 | Accuracy38.1 | 318 | |
| Mathematical Reasoning | OlympiadBench | Accuracy56.3 | 213 | |
| Mathematical Reasoning | MATH500 | Accuracy86.6 | 86 | |
| General Reasoning | General Reasoning Suite Average | Pass@137.77 | 63 | |
| Math Reasoning | MATH | Pass@186.04 | 18 | |
| Logical Reasoning Question Answering | LSAT | Pass@10.2842 | 17 | |
| Mathematical Reasoning | AMC 23 | Accuracy78.3 | 11 | |
| Math Reasoning | AIME 2025 | Pass@127.92 | 6 | |
| General Reasoning | GPQA Diamond | Pass@1 Accuracy36.17 | 6 | |
| Math Reasoning | OlympiadBench | Pass@148.18 | 6 |