UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs
About
Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of large language models (LLMs) on mathematics and programming tasks, but standard approaches that optimize single-attempt accuracy can inadvertently suppress response diversity across repeated attempts, narrowing exploration and overlooking underrepresented strategies. We introduce UpSkill, a training time method that adapts Mutual Information Skill Learning (MISL) to LLMs for optimizing pass@k correctness. We propose a novel reward that we implement within Group Relative Policy Optimization (GRPO): a token-level mutual information (MI) reward that encourages trajectory specificity to z. Experiments on GSM8K with three open-weight models, Llama 3.1-8B, Qwen 2.5-7B, and R1-Distilled-Qwen2.5-Math-1.5B, show that UpSkill improves multi-attempt metrics on the stronger base models, yielding mean gains of ~3% in pass@k for both Qwen and Llama without degrading pass@1. Additionally, we find both empirical and theoretical evidence that improvements in pass@k are closely tied to the mutual information objective.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Code Generation | CodeContests (test) | -- | 68 | |
| Code Generation | APPS (test) | -- | 36 | |
| Code Generation | APPS Introductory | -- | 25 | |
| Code Generation | LiveCodeBench LCBv6 (held-out) | Pass@454.2 | 24 | |
| Code Generation | CodeContests official (val) | Pass@413.6 | 24 | |
| Code Generation | LiveCodeBench v6 (test) | Pass@454.2 | 16 | |
| Code Generation | LCB v6 (fixed 500-problem slice) | Pass@441.1 | 6 |