Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MulFeRL: Enhancing Reinforcement Learning with Verbal Feedback in a Multi-turn Loop

About

Reinforcement Learning with Verifiable Rewards (RLVR) is widely used to improve reasoning across domains, but outcome-only scalar rewards are often sparse and uninformative. This limitation is especially severe for failed samples, where scalar rewards indicate only that a solution is incorrect without explaining why the reasoning breaks down. In this paper, we leverage richer verbal feedback to guide RLVR on failed samples and convert feedback-induced progress into trainable learning signals. We propose MulFeRL (Multi-turn Feedback-guided Reinforcement Learning), a multi-turn, event-triggered RLVR framework that combines progress induction for feedback-guided regeneration of failed samples, progress credit assignment for learning from verifier-confirmed progress, and structured feedback injection for integrating feedback into the model's reasoning process. Trained on sampled OpenR1-Math, MulFeRL outperforms supervised, self-distillation-based, and RLVR baselines in-domain, while also showing strong out-of-domain generalization.

Xuancheng Li, Haitao Li, Yujia Zhou, YiqunLiu, Qingyao Ai• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH 500
pass@190.24
239
Mathematical ReasoningMinerva
Pass@155.96
138
Mathematical ReasoningOlympiad Bench
Pass@1 Accuracy68.13
115
Mathematical ReasoningAMC23
Pass@190.3
43
Mathematical ReasoningAIME 24
Pass@168.13
39
Scientific and General ReasoningMMLU-Pro
Pass@168.08
21
Scientific and General ReasoningGPQA Diamond
Pass@150.2
21
Scientific and General ReasoningTheorem QA
Pass@155.75
18
Showing 8 of 8 rows

Other info

Follow for update