Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Adaptive Rigor in AI System Evaluation using Temperature-Controlled Verdict Aggregation via Generalized Power Mean

About

Existing evaluation methods for LLM-based AI systems, such as LLM-as-a-Judge, verdict systems, and NLI, do not always align well with human assessment because they cannot adapt their strictness to the application domain. This paper presents Temperature-Controlled Verdict Aggregation (TCVA), a method that combines a five-level verdict-scoring system with generalized power-mean aggregation and an intuitive temperature parameter T [0.1, 1.0] to control evaluation rigor. Low temperatures yield pessimistic scores suited for safety-critical domains; high temperatures produce lenient scores appropriate for conversational AI. Experimental evaluation on three benchmark datasets with human Likert-scale annotations (SummEval and USR) shows that TCVA achieves correlation with human judgments comparable to RAGAS on faithfulness (Spearman = 0.667 vs. 0.676) while consistently outperforming DeepEval. The method requires no additional LLM calls when adjusting the temperature parameter.

Aleksandr Meshkov• 2026

Related benchmarks

TaskDatasetResultRank
RelevancySummEval Rel
Spearman's Rho0.48
10
FaithfulnessSummEval
Spearman's Rho0.667
10
DialogueUSR (N = 198)
Spearman's Rho0.173
7
Dialogue EvaluationUSR
Spearman's rho0.173
3
Dialogue Faithfulness EvaluationUSR
Kendall's Tau0.143
3
Faithfulness EvaluationSummEval
Kendall's tau (τ)0.527
3
Showing 6 of 6 rows

Other info

Follow for update