Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation

About

Reinforcement Learning with Verifiable Rewards (RLVR) has driven substantial progress in reasoning-intensive domains like mathematics. However, optimizing open-ended generation remains challenging due to the lack of ground truth. While rubric-based evaluation offers a structured proxy for verification, existing methods suffer from scalability bottlenecks and coarse criteria, resulting in a supervision ceiling effect. To address this, we propose an automated Coarse-to-Fine Rubric Generation framework. By synergizing principle-guided synthesis, multi-model aggregation, and difficulty evolution, our approach produces comprehensive and highly discriminative criteria capable of capturing the subtle nuances. Based on this framework, we introduce RubricHub, a large-scale ($\sim$110k) and multi-domain dataset. We validate its utility through a two-stage post-training pipeline comprising Rubric-based Rejection Sampling Fine-Tuning (RuFT) and Reinforcement Learning (RuRL). Experimental results demonstrate that RubricHub unlocks significant performance gains: our post-trained Qwen3-14B achieves state-of-the-art (SOTA) results on HealthBench (69.3), surpassing proprietary frontier models such as GPT-5. Our code is available at \href{https://github.com/teqkilla/RubricHub}{ this URL}.

Sunzhu Li, Jiale Zhao, Miteto Wei, Huimin Ren, Yang Zhou, Jingwen Yang, Shunyu Liu, Kaike Zhang, Wei Chen• 2026

Related benchmarks

TaskDatasetResultRank
Instruction FollowingIFEval
Accuracy (IFEval)79.8
89
Pairwise EvaluationBIGGEN
Human Agreement72.22
41
Pairwise EvaluationAlpacaEval
Human Agreement64.64
37
Medical ReasoningHealthBench
Accuracy33
36
General Utility EvaluationMT_Bench
Agreement Rate81.72
33
Creative WritingCreative Writing v3
Overall Rubric Score39
32
Pointwise evaluationBIGGEN
Spearman Corr0.332
32
Pointwise evaluationHelpSteer2
Spearman Correlation0.286
28
Creative WritingWritingBench
Score56.9
18
Instruction FollowingIFBench
Accuracy33.5
18
Showing 10 of 12 rows

Other info

GitHub

Follow for update