Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs

About

Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of large language models (LLMs) on mathematics and programming tasks, but standard approaches that optimize single-attempt accuracy can inadvertently suppress response diversity across repeated attempts, narrowing exploration and overlooking underrepresented strategies. We introduce UpSkill, a training time method that adapts Mutual Information Skill Learning (MISL) to LLMs for optimizing pass@k correctness. We propose a novel reward that we implement within Group Relative Policy Optimization (GRPO): a token-level mutual information (MI) reward that encourages trajectory specificity to z. Experiments on GSM8K with three open-weight models, Llama 3.1-8B, Qwen 2.5-7B, and R1-Distilled-Qwen2.5-Math-1.5B, show that UpSkill improves multi-attempt metrics on the stronger base models, yielding mean gains of ~3% in pass@k for both Qwen and Llama without degrading pass@1. Additionally, we find both empirical and theoretical evidence that improvements in pass@k are closely tied to the mutual information objective.

Devan Shah, Owen Yang, Daniel Yang, Chongyi Zheng, Benjamin Eysenbach• 2026

Related benchmarks

TaskDatasetResultRank
Code GenerationCodeContests (test)--
68
Code GenerationAPPS (test)--
36
Code GenerationAPPS Introductory--
25
Code GenerationLiveCodeBench LCBv6 (held-out)
Pass@454.2
24
Code GenerationCodeContests official (val)
Pass@413.6
24
Code GenerationLiveCodeBench v6 (test)
Pass@454.2
16
Code GenerationLCB v6 (fixed 500-problem slice)
Pass@441.1
6
Showing 7 of 7 rows

Other info

Follow for update