Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Instructions are all you need: Self-supervised Reinforcement Learning for Instruction Following

About

Language models often struggle to follow multi-constraint instructions that are crucial for real-world applications. Existing reinforcement learning (RL) approaches suffer from dependency on external supervision and sparse reward signals from multi-constraint tasks. We propose a label-free self-supervised RL framework that eliminates dependency on external supervision by deriving reward signals directly from instructions and generating pseudo-labels for reward model training. Our approach introduces constraint decomposition strategies and efficient constraint-wise binary classification to address sparse reward challenges while maintaining computational efficiency. Experiments show that our approach generalizes well, achieving strong improvements across 3 in-domain and 5 out-of-domain datasets, including challenging agentic and multi-turn instruction following. The data and code are publicly available at https://github.com/Rainier-rq/verl-if

Qingyu Ren, Qianyu He, Powei Chang, Jie Zeng, Zeye Sun, Fei Yu, Jiaqing Liang, Yanghua Xiao• 2025

Related benchmarks

TaskDatasetResultRank
Instruction FollowingFollowBench
HSR57.5
85
Instruction FollowingCF-Bench
Instruction Success Rate52
68
Instruction FollowingIFEval--
65
Instruction FollowingMulti-IF
Score64.3
41
Instruction FollowingAgentIF
CSR56.7
29
Instruction FollowingWritingBench
Average Score58.5
29
Instruction FollowingComplexBench Out-of-Domain
Overall Score79.8
23
Instruction FollowingAgentIF (Out-of-Domain)
CSR66.9
23
Instruction FollowingIFEval In-Domain
Precision (L)0.871
23
Instruction FollowingCFBench In-Domain
ISR68
23
Showing 10 of 14 rows

Other info

Follow for update