Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ARES: Automated Rubric Synthesis for Scalable LLM Reinforcement Learning

About

Rubric-based rewards offer a promising way to extend reinforcement learning (RL) for large language models beyond tasks with automatically verifiable answers. However, scaling rubric-based RL remains challenging: existing approaches often rely on expert-written rubrics and manually constructed question sets, while fixed task-level rubrics may fail to capture the evaluation requirements of individual questions. We propose ARES (Automated Rubric synthEsis for Scalable RL), a framework for automatically constructing rubric-based RL data at scale. Starting from raw pretraining documents, ARES converts source knowledge into self-contained question-answer pairs and co-generates question-specific weighted rubrics, enabling instance-level reward supervision for open-ended responses. To improve diversity and quality, ARES conditions generation on domain labels and persona information, and applies validation filters for question self-containment, answer faithfulness, and rubric validity. Using ARES, we construct 100K rubric-annotated instances across ten domains. Experiments on seven benchmarks show that rubric-based RL trained with ARES, outperforms continual pretraining, supervised fine-tuning, and binary-reward RL, with the largest gains on multi-dimensional open-ended tasks such as healthcare and instruction following.

Xiaoyuan Li, Keqin Bao, Moxin Li, Yubo Ma, Yichang Zhang, Wenjie Wang, Fuli Feng, Dayiheng Liu• 2026

Related benchmarks

TaskDatasetResultRank
Math ReasoningGSM8K
Accuracy (GSM8K)86.96
131
Knowledge ReasoningMMLU-Pro--
120
WritingWritingBench
Score38.24
74
Language UnderstandingMMLU-Pro
MMLU-Pro Accuracy50.56
60
Open-ended writingWritingBench
Score38.24
20
Instruction FollowingIFEval
Score (%)54.88
18
Code GenerationMBPP+
AVG Score63.16
17
Aggregate General PerformanceARES Evaluation Suite
Average Score52.69
5
Code GenerationHumanEval+
Score34.76
5
Healthcare QAHealthBench
Score41.45
5
Showing 10 of 11 rows

Other info

Follow for update