SPARD: Self-Paced Curriculum for RL Alignment via Integrating Reward Dynamics and Data Utility

About

The evolution of Large Language Models (LLMs) is shifting the focus from single, verifiable tasks toward complex, open-ended real-world scenarios, imposing significant challenges on the post-training phase. In these settings, the scale and complexity of reward systems have grown significantly, transitioning toward multi-objective formulations that encompass a comprehensive spectrum of model capabilities and application contexts. However, traditional methods typically rely on fixed reward weights, ignoring non-stationary learning dynamics and struggling with data heterogeneity across dimensions. To address these issues, we propose SPARD, a framework that establishes an automated, self-paced curriculum by perceiving learning progress to dynamically adjust multi-objective reward weights and data importance, thereby synchronizing learning intent with data utility for optimal performance. Extensive experiments across multiple benchmarks demonstrate that SPARD significantly enhances model capabilities across all domains.

Xuyang Zhi, Peilun zhou, Chengqiang Lu, Hang Lv, Yiwei Liang, Rongyang Zhang, Yan Gao, YI WU, Yao Hu, Hongchao Gu, Defu Lian, Hao Wang, Enhong Chen• 2026

Related benchmarks

Task	Dataset	Result
Instruction Following	IFEval	IFEval Accuracy88.17	836
Scientific Reasoning	GPQA Diamond	Score49.49	68
Dialogue	MT-Bench	MT-Bench Score8.138	41
Creative Writing	CreativeWriting v3	LLM Judge73.95	26
Subjective Evaluation	WildBench	Score0.5529	19
Creative Writing	Arena Hard	Win Rate45.9	18
General Performance Evaluation	Aggregate (IFEval, GPQA, LCB, Arena-Hard, CW, MT-Bench, WildBench)	Average Score63.69	14

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord