RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation

About

Reinforcement Learning with Verifiable Rewards (RLVR) has driven substantial progress in reasoning-intensive domains like mathematics. However, optimizing open-ended generation remains challenging due to the lack of ground truth. While rubric-based evaluation offers a structured proxy for verification, existing methods suffer from scalability bottlenecks and coarse criteria, resulting in a supervision ceiling effect. To address this, we propose an automated Coarse-to-Fine Rubric Generation framework. By synergizing principle-guided synthesis, multi-model aggregation, and difficulty evolution, our approach produces comprehensive and highly discriminative criteria capable of capturing the subtle nuances. Based on this framework, we introduce RubricHub, a large-scale ($\sim$110k) and multi-domain dataset. We validate its utility through a two-stage post-training pipeline comprising Rubric-based Rejection Sampling Fine-Tuning (RuFT) and Reinforcement Learning (RuRL). Experimental results demonstrate that RubricHub unlocks significant performance gains: our post-trained Qwen3-14B achieves state-of-the-art (SOTA) results on HealthBench (69.3), surpassing proprietary frontier models such as GPT-5. Our code is available at \href{https://github.com/teqkilla/RubricHub}{ this URL}.

Sunzhu Li, Jiale Zhao, Miteto Wei, Huimin Ren, Yang Zhou, Jingwen Yang, Shunyu Liu, Kaike Zhang, Wei Chen• 2026

Related benchmarks

Task	Dataset	Result
Instruction Following	IFEval	Accuracy (IFEval)79.8	101
Creative Writing	Creative Writing v3	Overall Rubric Score39	44
Creative Writing	WritingBench	Score56.9	42
Pairwise Evaluation	BIGGEN	Human Agreement72.22	41
Pairwise Evaluation	AlpacaEval	Human Agreement64.64	37
Medical Reasoning	HealthBench	Accuracy33	36
General Utility Evaluation	MT_Bench	Agreement Rate81.72	33
Pointwise evaluation	BIGGEN	Spearman Corr0.332	32
Pointwise evaluation	HelpSteer2	Spearman Correlation0.286	28
Instruction Following	IFBench	Accuracy33.5	18

Showing 10 of 12 rows

Other info

GitHub

Follow for update

@wizwand_team Discord