LMUnit: Fine-grained Evaluation with Natural Language Unit Tests

About

As language models become integral to critical workflows, assessing their behavior remains a fundamental challenge -- human evaluation is costly and noisy, while automated metrics provide only coarse, difficult-to-interpret signals. We introduce natural language unit tests, a paradigm that decomposes response quality into explicit, testable criteria, along with a unified scoring model, LMUnit, which combines multi-objective training across preferences, direct ratings, and natural language rationales. Through controlled human studies, we show this paradigm significantly improves inter-annotator agreement and enables more effective LLM development workflows. LMUnit achieves state-of-the-art performance on evaluation benchmarks (FLASK, BigGenBench) and competitive results on RewardBench. These results validate both our proposed paradigm and scoring model, suggesting a promising path forward for language model evaluation and development.

Jon Saad-Falcon, Rajan Vivek, William Berrios, Nandita Shankar Naik, Matija Franklin, Bertie Vidgen, Amanpreet Singh, Douwe Kiela, Shikib Mehri• 2024

Related benchmarks

Task	Dataset	Result
Reward Modeling	RewardBench v2 (test)	Average Score82.1	67
Reward Modeling	RewardBench 2	Precise IF Score54.4	41
Pair-wise comparison	RewardBench	Accuracy93.45	29
Reward Model Evaluation	RewardBench 2	Factuality87.2	21
Pairwise Ranking	LFQA	Pairwise Preference Accuracy76.53	13
Direct Assessment	FLASK	Pearson Correlation Coefficient0.7203	12
Direct Assessment	BiGGen-Bench	Pearson Correlation Coefficient67.69	12
Model Performance Evaluation	Table 1 Aggregate excluding Human-Internal	Average Score79.74	12
Classification	InfoBench	Binary Accuracy91.26	12
Classification	Human-Internal	Binary Accuracy94.14	10

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord