Instructions are all you need: Self-supervised Reinforcement Learning for Instruction Following

About

Language models often struggle to follow multi-constraint instructions that are crucial for real-world applications. Existing reinforcement learning (RL) approaches suffer from dependency on external supervision and sparse reward signals from multi-constraint tasks. We propose a label-free self-supervised RL framework that eliminates dependency on external supervision by deriving reward signals directly from instructions and generating pseudo-labels for reward model training. Our approach introduces constraint decomposition strategies and efficient constraint-wise binary classification to address sparse reward challenges while maintaining computational efficiency. Experiments show that our approach generalizes well, achieving strong improvements across 3 in-domain and 5 out-of-domain datasets, including challenging agentic and multi-turn instruction following. The data and code are publicly available at https://github.com/Rainier-rq/verl-if

Qingyu Ren, Qianyu He, Powei Chang, Jie Zeng, Zeye Sun, Fei Yu, Jiaqing Liang, Yanghua Xiao• 2025

Related benchmarks

Task	Dataset	Result
Instruction Following	FollowBench	HSR57.5	85
Instruction Following	CF-Bench	Instruction Success Rate52	68
Instruction Following	IFEval	--	65
Instruction Following	Multi-IF	Score64.3	41
Instruction Following	AgentIF	CSR56.7	29
Instruction Following	WritingBench	Average Score58.5	29
Instruction Following	ComplexBench Out-of-Domain	Overall Score79.8	23
Instruction Following	AgentIF (Out-of-Domain)	CSR66.9	23
Instruction Following	IFEval In-Domain	Precision (L)0.871	23
Instruction Following	CFBench In-Domain	ISR68	23

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord