Adaptive Rigor in AI System Evaluation using Temperature-Controlled Verdict Aggregation via Generalized Power Mean

About

Existing evaluation methods for LLM-based AI systems, such as LLM-as-a-Judge, verdict systems, and NLI, do not always align well with human assessment because they cannot adapt their strictness to the application domain. This paper presents Temperature-Controlled Verdict Aggregation (TCVA), a method that combines a five-level verdict-scoring system with generalized power-mean aggregation and an intuitive temperature parameter T [0.1, 1.0] to control evaluation rigor. Low temperatures yield pessimistic scores suited for safety-critical domains; high temperatures produce lenient scores appropriate for conversational AI. Experimental evaluation on three benchmark datasets with human Likert-scale annotations (SummEval and USR) shows that TCVA achieves correlation with human judgments comparable to RAGAS on faithfulness (Spearman = 0.667 vs. 0.676) while consistently outperforming DeepEval. The method requires no additional LLM calls when adjusting the temperature parameter.

Aleksandr Meshkov• 2026

Related benchmarks

Task	Dataset	Result
Relevancy	SummEval Rel	Spearman's Rho0.48	10
Faithfulness	SummEval	Spearman's Rho0.667	10
Dialogue	USR (N = 198)	Spearman's Rho0.173	7
Dialogue Evaluation	USR	Spearman's rho0.173	3
Dialogue Faithfulness Evaluation	USR	Kendall's Tau0.143	3
Faithfulness Evaluation	SummEval	Kendall's tau (τ)0.527	3

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord