Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MulFeRL: Enhancing Reinforcement Learning with Verbal Feedback in a Multi-turn Loop

About

Reinforcement Learning with Verifiable Rewards (RLVR) is widely used to improve reasoning in multiple domains, yet outcome-only scalar rewards are often sparse and uninformative, especially on failed samples, where they merely indicate failure and provide no insight into why the reasoning fails. In this paper, we investigate how to leverage richer verbal feedback to guide RLVR training on failed samples, and how to convert such feedback into a trainable learning signal. Specifically, we propose a multi-turn feedback-guided reinforcement learning framework. It builds on three mechanisms: (1) dynamic multi-turn regeneration guided by feedback, triggered only on failed samples, (2) two complementary learning signals for within-turn and cross-turn optimization, and (3) structured feedback injection into the model's reasoning process. Trained on sampled OpenR1-Math, the approach outperforms supervised fine-tuning and RLVR baselines in-domain and generalizes well out-of-domain.

Xuancheng Li, Haitao Li, Yujia Zhou, YiqunLiu, Qingyao Ai• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH 500
pass@190.24
153
Mathematical ReasoningMinerva
Pass@155.96
138
Mathematical ReasoningOlympiad Bench
Pass@1 Accuracy68.13
115
Mathematical ReasoningAMC23
Pass@190.3
43
Mathematical ReasoningAIME 24
Pass@168.13
39
Scientific and General ReasoningMMLU-Pro
Pass@168.08
21
Scientific and General ReasoningGPQA Diamond
Pass@150.2
21
Scientific and General ReasoningTheorem QA
Pass@155.75
18
Showing 8 of 8 rows

Other info

Follow for update