PowerFlow: Unlocking the Dual Nature of LLMs via Principled Distribution Matching
About
Unsupervised Reinforcement Learning from Internal Feedback (RLIF) has emerged as a promising paradigm for eliciting the latent capabilities of Large Language Models (LLMs) without external supervision. However, current methods rely on heuristic intrinsic rewards, which often lack a well-defined theoretical optimization target and are prone to degenerative biases. In this work, we introduce PowerFlow, a principled framework that reformulates unsupervised fine-tuning as a distribution matching problem. By casting GFlowNet as an amortized variational sampler for unnormalized densities, we propose a length-aware Trajectory-Balance objective that explicitly neutralizes the structural length biases inherent in autoregressive generation. By targeting $\alpha$-power distributions, PowerFlow enables the directional elicitation of the dual nature of LLMs: sharpening the distribution ($\alpha > 1$) to intensify logical reasoning, or flattening it ($\alpha < 1$) to unlock expressive creativity. Extensive experiments demonstrate that PowerFlow consistently outperforms existing RLIF methods, matching or even exceeding supervised GRPO. Furthermore, by mitigating over-sharpening in aligned models, our approach achieves simultaneous gains in diversity and quality, shifting the Pareto frontier in creative tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | AIME 2024 | Accuracy @1620 | 36 | |
| Joke generation | Joke | Quality0.8791 | 29 | |
| Mathematical Reasoning | AIME 2025 | Avg@1614.4 | 28 | |
| Mathematical Reasoning | AMC 2023 | Avg@16 Score63.4 | 28 | |
| Scientific Reasoning | GPQA | avg@1637 | 28 | |
| Creative Writing | Overall Average Poem, Joke, Story | Semantic Diversity0.3311 | 20 | |
| Creative Writing | Story | Semantic Diversity29.94 | 20 | |
| Creative Writing | Poem | Semantic Diversity24.92 | 20 |