Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models

About

Reinforcement learning enhances the reasoning capabilities of large language models but often involves high computational costs due to rollout-intensive optimization. Online prompt selection presents a plausible solution by prioritizing informative prompts to improve training efficiency. However, current methods either depend on costly, exact evaluations or construct prompt-specific predictive models lacking generalization across prompts. This study introduces Generalizable Predictive Prompt Selection (GPS), which performs Bayesian inference towards prompt difficulty using a lightweight generative model trained on the shared optimization history. Intermediate-difficulty prioritization and history-anchored diversity are incorporated into the batch acquisition principle to select informative prompt batches. The small predictive model also generalizes at test-time for efficient computational allocation. Experiments across varied reasoning benchmarks indicate GPS's substantial improvements in training efficiency, final performance, and test-time efficiency over superior baseline methods.

Yun Qu, Qi Wang, Yixiu Mao, Heming Zou, Yuhang Jiang, Weijie Liu, Clive Bai, Kai Yang, Yangkun Chen, Saiyong Yang, Xiangyang Ji• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH 500
pass@193.2
153
General ReasoningMMLU-Pro
MMLU-Pro General Reasoning Avg@8 Acc0.528
51
ReasoningARC Challenge--
45
Mathematical ReasoningOlympiadBench
Pass@162.1
39
Mathematical ReasoningAIME 24
Avg@32 Accuracy51.8
23
General ReasoningGPQA Diamond
Avg@8 Accuracy30.4
14
Logical reasoningCountdown CD4
Avg@1659.4
14
Logical reasoningCountdown CD34
Avg@1677.9
14
Mathematical ReasoningAMC23
Avg@32 Accuracy90.5
14
Showing 9 of 9 rows

Other info

Follow for update