Generating and Refining Dynamic Evaluation Rubrics for LLM-as-a-Judge

About

LLM-as-a-Judge is a scalable alternative to human evaluation, yet existing rubric-based methods rely on human-annotated data such as reference answers or expert-crafted rubrics. We propose to automatically generate fine-grained evaluation rubrics without any human annotation. Our training-free method generates rubrics at dataset-specific and instance-specific granularities, achieving performance competitive with existing methods across four benchmarks. We further present a method that iteratively fine-tunes a rubric generator model via meta-judge reward signals. The fine-tuned generator outperforms all existing baselines in both pairwise and pointwise evaluation. Notably, a fine-tuned 14B rubric generator outperforms a much larger proprietary model at rubric generation, showing the effectiveness of our fine-tuning strategy.

Zijie Wang, Eduardo Blanco• 2026

Related benchmarks

Task	Dataset	Result
Pairwise Evaluation	BIGGEN	Human Agreement76.96	41
Pairwise Evaluation	AlpacaEval	Human Agreement72.4	37
General Utility Evaluation	MT_Bench	Agreement Rate81.62	33
Pointwise evaluation	BIGGEN	Spearman Corr0.51	32
Pointwise evaluation	HelpSteer2	Spearman Correlation0.464	28
Pairwise LLM Judging	MT-Bench	--	16
Pairwise Evaluation	MT-Bench	Human Agreement Rate83.69	9
Automated Metric Evaluation	Scientific QA LitQA2 (n=58)	Coverage1	5
Automated Metric Evaluation	Account grouping financial (n=112)	Coverage40	5
Automated Metric Evaluation	Inherent risk assessment (IRF) (n=112)	Coverage83	5

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord