Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Judge Like Human Examiners: A Weighted Importance Multi-Point Evaluation Framework for Generative Tasks with Long-form Answers

About

Evaluating the quality of model responses remains challenging in generative tasks with long-form answers, as the expected answers usually contain multiple semantically distinct yet complementary factors that should be factorized for fine-grained assessment. Recent evaluation methods resort to relying on either task-level rubrics or question-aware checklists. However, they still 1) struggle to assess whether a response is genuinely grounded in provided contexts; 2) fail to capture the heterogeneous importance of different aspects of reference answers. Inspired by human examiners, we propose a Weighted Importance Multi-Point Evaluation (WIMPE) framework, which factorizes each reference answer into weighted context-bound scoring points. Two complementary metrics, namely Weighted Point-wise Alignment (WPA) and Point-wise Conflict Penalty (PCP), are designed to measure the alignment and contradiction between model responses and reference answers. Extensive experiments on 10 generative tasks demonstrate that WIMPE achieves higher correlations with human annotations.

Guoxin Yu, Chulun Zhou, Lemao Liu, Qi Wang, Mo Yu, Jialong Tang, Baosong Yang, Xiang Ao, Wai Lam, Yue Yu• 2026

Related benchmarks

TaskDatasetResultRank
Long-form Answer GenerationStoryQA
Spearman Correlation0.9203
8
Long-form Answer GenerationReviewSumm
Spearman Correlation0.4383
8
Long-form Answer GenerationMeetingSum
Spearman Correlation0.5896
8
Long-form Answer GenerationFinancialQA
Spearman Correlation0.545
8
Long-form Answer GenerationPaperAssist
Spearman Correlation0.4687
8
Long-form Answer GenerationConversMem
Spearman Correlation0.8806
8
Long-form Answer GenerationLongStory
Spearman Correlation0.8442
8
Long-form Answer GenerationNewsSumm
Spearman Correlation0.4895
8
Long-form Answer GenerationMultiDocQA
Spearman Correlation0.392
8
Long-form Answer GenerationContractQA
Spearman Correlation0.5268
8
Showing 10 of 10 rows

Other info

Follow for update