MemReward: Graph-Based Experience Memory for LLM Reward Prediction with Limited Labels

About

Reinforcement learning has emerged as a powerful paradigm for improving large language model (LLM) reasoning, where rollouts are sampled from the policy and reward signals computed on those rollouts are used to update the policy. However, in data-scarce scenarios, obtaining ground-truth labels to verify rollouts at scale often requires expensive human annotation or labor-intensive expert verification. For instance, evaluating mathematical proofs demands expert review, and open-ended question answering lacks definitive ground truth. When ground-truth labels are scarce, the effectiveness of reinforcement learning fine-tuning is constrained. Inspired by the success of semi-supervised learning in propagating labels from labeled to unlabeled samples, we propose MemReward, a graph-based experience memory framework that integrates reward propagation directly into online policy optimization. MemReward stores rollouts (thinking processes and final answers) from an initial LLM policy as nodes in a heterogeneous graph connected by similarity and structural edges, over which a GNN propagates rewards from labeled to unlabeled rollouts. To train such a framework, we first warm up the GNN on labeled rollouts to predict rewards via heterogeneous aggregation over query, thinking, and answer nodes. During online RL fine-tuning, unlabeled rollouts are attached to the graph by query similarity, and the GNN predicts their rewards, yielding a hybrid reward acquisition strategy that combines ground-truth and GNN-predicted rewards. Experiments on Qwen2.5-1.5B and 3B in mathematics, question answering, and code generation demonstrate that MemReward, with ground-truth rewards on only 20% of rollouts, achieves 96.6% of Oracle performance on 1.5B and 97.3% on 3B, and closely approaches Oracle on out-of-domain tasks.

Tianyang Luo, Tao Feng, Zhigang Hua, Yan Xie, Shuang Yang, Ge Liu, Jiaxuan You• 2026

Related benchmarks

Task	Dataset	Result
Question Answering	ARC Challenge	Accuracy80.44	906
Mathematical Reasoning	GSM-sym	Exact Match86.44	44
Code Generation	MBPP+	Pass@163.75	40
Mathematical Reasoning	MATH	Exact Match Accuracy61.11	39
Code Generation	HumanEval+	Pass@161.54	34
Question Answering	GPQA	Accuracy30	33
Question Answering	OBQA	Accuracy81.78	14
Question Answering	MMLU	Accuracy0.72	8
Reward Prediction	NuminaMath (out-of-domain)	Accuracy42.22	6
Reward Prediction	SIQA (out-of-domain)	Accuracy76.89	6

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord