UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs

About

Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of large language models (LLMs) on mathematics and programming tasks, but standard approaches that optimize single-attempt accuracy can inadvertently suppress response diversity across repeated attempts, narrowing exploration and overlooking underrepresented strategies. We introduce UpSkill, a training time method that adapts Mutual Information Skill Learning (MISL) to LLMs for optimizing pass@k correctness. We propose a novel reward that we implement within Group Relative Policy Optimization (GRPO): a token-level mutual information (MI) reward that encourages trajectory specificity to z. Experiments on GSM8K with three open-weight models, Llama 3.1-8B, Qwen 2.5-7B, and R1-Distilled-Qwen2.5-Math-1.5B, show that UpSkill improves multi-attempt metrics on the stronger base models, yielding mean gains of ~3% in pass@k for both Qwen and Llama without degrading pass@1. Additionally, we find both empirical and theoretical evidence that improvements in pass@k are closely tied to the mutual information objective.

Devan Shah, Owen Yang, Daniel Yang, Chongyi Zheng, Benjamin Eysenbach• 2026

Related benchmarks

Task	Dataset	Result
Code Generation	CodeContests (test)	--	68
Code Generation	APPS (test)	--	36
Code Generation	APPS Introductory	--	25
Code Generation	LiveCodeBench LCBv6 (held-out)	Pass@454.2	24
Code Generation	CodeContests official (val)	Pass@413.6	24
Code Generation	LiveCodeBench v6 (test)	Pass@454.2	16
Code Generation	LCB v6 (fixed 500-problem slice)	Pass@441.1	6

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord