Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ExpThink: Experience-Guided Reinforcement Learning for Adaptive Chain-of-Thought Compression

About

Large reasoning models (LRMs) achieve strong performance via extended chain-of-thought (CoT) reasoning, yet suffer from excessive token consumption and high inference latency. Existing reinforcement learning (RL) approaches for CoT compression rely on uniform, static length penalties that neglect model capability dynamics and problem-level difficulty variation. We propose \textbf{ExpThink}\xspace, an RL framework that addresses both dimensions through two complementary mechanisms. First, \emph{experience-guided reward shaping} tracks the shortest correct solution found so far for each problem and applies a three-tier reward: full credit for concise correct responses, discounted credit for verbose correct ones, and zero for incorrect ones. The threshold tightens automatically with model improvement, forming a self-evolving curriculum that requires no manual scheduling. Second, \emph{difficulty-adaptive advantage} replaces standard deviation normalization with correct-count normalization, yielding monotonically difficulty-scaled gradients that amplify learning on hard problems to preserve accuracy while suppressing gradients on easy ones to encourage brevity. Together, these mechanisms enforce an accuracy-first, compression-second training objective. Experiments on multiple mathematical reasoning benchmarks demonstrate that \textbf{ExpThink}\xspace reduces average response length by up to 77\% while simultaneously improving accuracy, achieving up to $3\times$ higher accuracy-efficiency ratio (accuracy divided by average token count) than the vanilla baseline and outperforming existing RL-based compression methods on both metrics.

Tingcheng Bian, Yuzhe Zhang, Jing Jin, Jinchang Luo, MingQuan Cheng, Haiwei Wang, Wenyuan Jiang, Miaohui Wang• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAMC23
PASS@1 Accuracy82.4
207
Mathematical ReasoningAIME24
Pass@1 Accuracy65.6
117
Code ReasoningLiveCodeBench
Accuracy20.73
90
Mathematical ReasoningMinerva
Pass@1 Accuracy45.5
52
Scientific ReasoningGPQA Diamond
Accuracy47.2
41
Mathematical ReasoningAggregate AMC23, AIME24, MATH-500, Minerva, Olympiad
Intelligence Per Token (IPT)51.04
30
Mathematical ReasoningMATH 500
Pass@1 Accuracy94.2
16
Mathematical ReasoningOlympiadBench
Pass@1 Accuracy62.7
16
Out-of-domain GeneralizationLiveCodeBench, GPQA-Diamond, MMLU Average
IPT55.39
14
Multitask Language UnderstandingMMLU
Accuracy65.1
14
Showing 10 of 10 rows

Other info

Follow for update