Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Learning to Think: Information-Theoretic Reinforcement Fine-Tuning for LLMs

About

Large language models (LLMs) excel at complex tasks thanks to advances in their reasoning abilities. However, existing methods overlook the trade-off between reasoning effectiveness and efficiency, often encouraging unnecessarily long reasoning chains and wasting tokens. To address this, we propose Learning to Think (L2T), an information-theoretic reinforcement fine-tuning framework for LLMs to make the models achieve optimal reasoning with fewer tokens. Specifically, L2T treats each query-response interaction as a hierarchical session of multiple episodes and proposes a universal dense process reward, i.e., quantifies the episode-wise information gain in parameters, requiring no extra annotations or task-specific evaluators. We propose a method to quickly estimate this reward based on PAC-Bayes bounds and the Fisher information matrix. Theoretical analyses show that it significantly reduces computational complexity with high estimation accuracy. By immediately rewarding each episode's contribution and penalizing excessive updates, L2T optimizes the model via reinforcement learning to maximize the use of each episode and achieve effective updates. Empirical results on various reasoning benchmarks and base models demonstrate the advantage of L2T across different tasks, boosting both reasoning effectiveness and efficiency.

Jingyao Wang, Wenwen Qiang, Zeen Song, Changwen Zheng, Hui Xiong• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K
pass@192.9
102
Mathematical ReasoningAIME 2025
Pass@153.6
96
Mathematical ReasoningAIME 2024
Pass@158.4
86
Mathematical ReasoningMinerva Math
pass@1 Accuracy45
82
Mathematical ReasoningMath Benchmarks Aggregate
Pass@171.1
44
Mathematical ReasoningAMC 2023
Pass@187.5
30
Showing 6 of 6 rows

Other info

Follow for update