PowerFlow: Unlocking the Dual Nature of LLMs via Principled Distribution Matching

About

Unsupervised Reinforcement Learning from Internal Feedback (RLIF) has emerged as a promising paradigm for eliciting the latent capabilities of Large Language Models (LLMs) without external supervision. However, current methods rely on heuristic intrinsic rewards, which often lack a well-defined theoretical optimization target and are prone to degenerative biases. In this work, we introduce PowerFlow, a principled framework that reformulates unsupervised fine-tuning as a distribution matching problem. By casting GFlowNet as an amortized variational sampler for unnormalized densities, we propose a length-aware Trajectory-Balance objective that explicitly neutralizes the structural length biases inherent in autoregressive generation. By targeting $\alpha$-power distributions, PowerFlow enables the directional elicitation of the dual nature of LLMs: sharpening the distribution ($\alpha > 1$) to intensify logical reasoning, or flattening it ($\alpha < 1$) to unlock expressive creativity. Extensive experiments demonstrate that PowerFlow consistently outperforms existing RLIF methods, matching or even exceeding supervised GRPO. Furthermore, by mitigating over-sharpening in aligned models, our approach achieves simultaneous gains in diversity and quality, shifting the Pareto frontier in creative tasks.

Ruishuo Chen, Yu Chen, Zhuoran Li, Longbo Huang• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	AIME 2024	Accuracy @1620	81
Mathematical Reasoning	AMC 2023	Avg@16 Score63.4	48
Mathematical Reasoning	AIME 2025	--	32
Joke generation	Joke	Quality0.8791	29
Scientific Reasoning	GPQA	avg@1637	28
Creative Writing	Overall Average Poem, Joke, Story	Semantic Diversity0.3311	20
Creative Writing	Story	Semantic Diversity29.94	20
Creative Writing	Poem	Semantic Diversity24.92	20

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord