Instructions are all you need: Self-supervised Reinforcement Learning for Instruction Following
About
Language models often struggle to follow multi-constraint instructions that are crucial for real-world applications. Existing reinforcement learning (RL) approaches suffer from dependency on external supervision and sparse reward signals from multi-constraint tasks. We propose a label-free self-supervised RL framework that eliminates dependency on external supervision by deriving reward signals directly from instructions and generating pseudo-labels for reward model training. Our approach introduces constraint decomposition strategies and efficient constraint-wise binary classification to address sparse reward challenges while maintaining computational efficiency. Experiments show that our approach generalizes well, achieving strong improvements across 3 in-domain and 5 out-of-domain datasets, including challenging agentic and multi-turn instruction following. The data and code are publicly available at https://github.com/Rainier-rq/verl-if
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Instruction Following | FollowBench | HSR57.5 | 85 | |
| Instruction Following | CF-Bench | Instruction Success Rate52 | 68 | |
| Instruction Following | IFEval | -- | 65 | |
| Instruction Following | Multi-IF | Score64.3 | 41 | |
| Instruction Following | AgentIF | CSR56.7 | 29 | |
| Instruction Following | WritingBench | Average Score58.5 | 29 | |
| Instruction Following | ComplexBench Out-of-Domain | Overall Score79.8 | 23 | |
| Instruction Following | AgentIF (Out-of-Domain) | CSR66.9 | 23 | |
| Instruction Following | IFEval In-Domain | Precision (L)0.871 | 23 | |
| Instruction Following | CFBench In-Domain | ISR68 | 23 |