Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PowerFlow: Unlocking the Dual Nature of LLMs via Principled Distribution Matching

About

Unsupervised Reinforcement Learning from Internal Feedback (RLIF) has emerged as a promising paradigm for eliciting the latent capabilities of Large Language Models (LLMs) without external supervision. However, current methods rely on heuristic intrinsic rewards, which often lack a well-defined theoretical optimization target and are prone to degenerative biases. In this work, we introduce PowerFlow, a principled framework that reformulates unsupervised fine-tuning as a distribution matching problem. By casting GFlowNet as an amortized variational sampler for unnormalized densities, we propose a length-aware Trajectory-Balance objective that explicitly neutralizes the structural length biases inherent in autoregressive generation. By targeting $\alpha$-power distributions, PowerFlow enables the directional elicitation of the dual nature of LLMs: sharpening the distribution ($\alpha > 1$) to intensify logical reasoning, or flattening it ($\alpha < 1$) to unlock expressive creativity. Extensive experiments demonstrate that PowerFlow consistently outperforms existing RLIF methods, matching or even exceeding supervised GRPO. Furthermore, by mitigating over-sharpening in aligned models, our approach achieves simultaneous gains in diversity and quality, shifting the Pareto frontier in creative tasks.

Ruishuo Chen, Yu Chen, Zhuoran Li, Longbo Huang• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAIME 2024
Accuracy @1620
36
Joke generationJoke
Quality0.8791
29
Mathematical ReasoningAIME 2025
Avg@1614.4
28
Mathematical ReasoningAMC 2023
Avg@16 Score63.4
28
Scientific ReasoningGPQA
avg@1637
28
Creative WritingOverall Average Poem, Joke, Story
Semantic Diversity0.3311
20
Creative WritingStory
Semantic Diversity29.94
20
Creative WritingPoem
Semantic Diversity24.92
20
Showing 8 of 8 rows

Other info

Follow for update