MulFeRL: Enhancing Reinforcement Learning with Verbal Feedback in a Multi-turn Loop
About
Reinforcement Learning with Verifiable Rewards (RLVR) is widely used to improve reasoning across domains, but outcome-only scalar rewards are often sparse and uninformative. This limitation is especially severe for failed samples, where scalar rewards indicate only that a solution is incorrect without explaining why the reasoning breaks down. In this paper, we leverage richer verbal feedback to guide RLVR on failed samples and convert feedback-induced progress into trainable learning signals. We propose MulFeRL (Multi-turn Feedback-guided Reinforcement Learning), a multi-turn, event-triggered RLVR framework that combines progress induction for feedback-guided regeneration of failed samples, progress credit assignment for learning from verifier-confirmed progress, and structured feedback injection for integrating feedback into the model's reasoning process. Trained on sampled OpenR1-Math, MulFeRL outperforms supervised, self-distillation-based, and RLVR baselines in-domain, while also showing strong out-of-domain generalization.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MATH 500 | pass@190.24 | 239 | |
| Mathematical Reasoning | Minerva | Pass@155.96 | 138 | |
| Mathematical Reasoning | Olympiad Bench | Pass@1 Accuracy68.13 | 115 | |
| Mathematical Reasoning | AMC23 | Pass@190.3 | 43 | |
| Mathematical Reasoning | AIME 24 | Pass@168.13 | 39 | |
| Scientific and General Reasoning | MMLU-Pro | Pass@168.08 | 21 | |
| Scientific and General Reasoning | GPQA Diamond | Pass@150.2 | 21 | |
| Scientific and General Reasoning | Theorem QA | Pass@155.75 | 18 |