Revisiting LLM Reasoning via Information Bottleneck
About
Large language models (LLMs) have recently demonstrated remarkable progress in reasoning capabilities through reinforcement learning with verifiable rewards (RLVR). By leveraging simple rule-based rewards, RL effectively incentivizes LLMs to produce extended chain-of-thought (CoT) reasoning trajectories, progressively guiding them toward correct answers. However, existing approaches remain largely heuristic and intuition-driven, limiting the development of principled methodologies. In this paper, we present a theoretical characterization of LLM reasoning grounded in information bottleneck (IB) principle, introducing IB-aware reasoning optimization (IBRO), a framework that encourages reasoning trajectories to be both informative about the final correct answer and generalizable across diverse prompts. We derive a practical token-level surrogate objective and propose an efficient approximation, resulting in the lightweight IB regularization method. This technique integrates seamlessly into existing RL-based post-training frameworks without additional computational overhead, requiring only a one-line code modification. Empirically, we validate IB regularization across multiple mathematical reasoning benchmarks and RL algorithms, demonstrating consistent improvements in LLM reasoning performance.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | AIME 25 | Accuracy15.7 | 112 | |
| Instruction Following | IFEval | Accuracy (IFEval)54.3 | 89 | |
| Science Reasoning | GPQA | Accuracy (GPQA)44.7 | 72 | |
| Mathematics | AIME 25 | Avg@3214.5 | 20 | |
| Mathematics | AIME 24 | Avg@320.169 | 20 | |
| Comprehensive Evaluation | Overall Across Benchmarks | Avg@32 Accuracy41.6 | 16 | |
| Instruction | IFEval | Avg@32 Accuracy44.7 | 16 | |
| Mathematics | MATH 500 | Accuracy (avg@32)82 | 16 | |
| Mathematics | AMC 23 | Avg@32 Accuracy55.3 | 16 | |
| Mathematics | AMC 24 | Accuracy (avg@32)39.5 | 16 |