AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning

About

Rubric-based reward shaping provides interpretable and editable reward signals for fine-tuning LLMs via reinforcement learning (RL), but existing adaptive rubric methods typically update criteria from local evidence such as the current batch or instance-level comparisons. This local view discards diagnostic information produced during training, making it difficult to track recurring failures, evaluate previous rubric edits, or raise standards once earlier criteria become saturated. We introduce AMARIS, A Memory-Augmented Rubric Improvement System that grounds rubric updates in longitudinal training evidence. AMARIS stores rollout analyses, step-level summaries, and rubric update records in a persistent evaluation memory, then retrieves recent and semantically relevant history to revise rubrics. We evaluate AMARIS across science, medicine, instruction following, and creative writing under both global and instance-specific rubric settings. AMARIS improves over static, local-adaptive, and memory-ablated baselines, such as +2.8 points on GPQA-Diamond and +2.2 points on IFBench over the strongest baselines, while analysis shows that memory reduces oscillatory rubric edits and supports a progression from early failure correction to later curriculum advancement. AMARIS runs asynchronously alongside the normal RL loop, reducing blocking latency relative to synchronous rubric updates.

Peilin Wu, Xinlu Zhang, Kun Wan, Wentian Zhao, Gang Wu, Xinya Du, Zhiyu Chen• 2026

Related benchmarks

Task	Dataset	Result
Instruction Following	IFEval	Accuracy (IFEval)81	101
Creative Writing	Creative Writing v3	Overall Rubric Score40.1	44
Creative Writing	WritingBench	Score57.9	42
Medical Reasoning	HealthBench	Accuracy34	36
Instruction Following	IFBench	Accuracy36	18
Instruction Following	InfoBench	Accuracy85.2	8
Scientific Reasoning	GPQA Diamond	Accuracy40.4	6

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord