Leveraging Error Diversity in Group Rollouts for Reinforcement Learning

About

Reinforcement Learning from Verifiable Rewards (RLVR) typically samples multiple responses per prompt and assigns binary rewards based on individual correctness, yet the collective structure of the group output, specifically the distribution of errors, is largely discarded. We identify this as a missed opportunity: empirical analysis reveals that error diversity within a group is a strong predictor of training success, with problems eliciting diverse wrong answers benefiting substantially more from RLVR than those producing homogeneous failures. Motivated by this observation, we propose Error Diversity Advantage Shaping (EDAS), a lightweight, algorithm-agnostic technique that modulates the advantage signal for incorrect rollouts based on intra-group error diversity. EDAS amplifies penalties for dominant, repeated errors and attenuates penalties for rare, exploratory ones, thereby encouraging the model to maintain diverse reasoning paths and discouraging error perseveration. Crucially, EDAS operates as a simple post-hoc adjustment that can be seamlessly integrated into any RLVR algorithm. We validate EDAS on top of several mainstream RLVR methods across a series of models and seven challenging math benchmarks, demonstrating consistent improvements. Notably, EDAS yields an average improvement of 6.29 points over DAPO on Qwen3-8B across seven benchmarks, confirming that exploiting the latent information in group rollouts is a broadly effective strategy for strengthening RLVR.

Wenpu Liu, Yuqi Xu, Weichu Xie, Yongfu Zhu, Shuai Dong, Ziyue Wang, Wenqi Shao, Xiaoying Zhang, Tong Yang, Nan Duan, Jiaqi Wang• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	AIME 2024	Accuracy48.85	479
Mathematical Reasoning	AIME 2026	AIME 2026 Accuracy48.65	55
Mathematical Reasoning	HMMT	Accuracy23.12	39
Mathematical Reasoning	AMC 2024	Accuracy59.38	23
Mathematical Reasoning	AMC, AIME, HMMT, and OlympiadBench Aggregate	Accuracy54.11	15
Code Generation	LiveCodeBench	Pass@k32.07	2
Code Generation	CodeForces	Pass@k45.97	2
Code Generation	HumanEval+	Pass@k80.49	2
Code Generation	MBPP+	Pass@k64.02	2

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord