VERDI: Single-Call Confidence Estimation for Verification-Based LLM Judges via Decomposed Inference

About

LLM-as-Judge systems are widely deployed for automated evaluation, yet practitioners lack reliable methods to know when a judge's verdict should be trusted. Token log-probabilities, the standard post-hoc confidence signal, are unavailable for many commercial LLMs and, even when accessible, saturate above 0.999 with structured JSON output. We introduce VERDI (VERification-Decomposed Inference), a method that extracts confidence from the reasoning trace a structured judge already produces, with no additional inference calls. VERDI decomposes each verification-style evaluation into sub-checks and derives three structural signals: Step-Verdict Alignment, Claim-Level Margin, and Evidence Grounding Score. We combine them with Platt-scaled logistic regression. On three public benchmarks, VERDI achieves AUROC 0.72-0.91 on GPT-4.1-mini and 0.66-0.80 on GPT-5.4-mini. On Qwen3.5-4B/9B/27B, where answer-token logprobs are anti-calibrated (higher confidence on errors, AUROC 0.32-0.49), VERDI achieves 0.56-0.70. We additionally validate on a production system with eight rubrics (AUROC 0.73-0.88 on factual rubrics), demonstrate cross-model transfer (AUROC 0.66-0.69), and show that a 33M-parameter NLI (Natural Language Inference) model provides a scalable alternative to regex extraction.

Jasmine Qi, Danylo Dantsev, Muyang Sun• 2026

Related benchmarks

Task	Dataset	Result
Scientific Fact Verification	SciFact	--	25
Sentence-Level Confidence Prediction	FEVER	AUROC0.737	15
Sentence-Level Confidence Prediction	SciFact	AUROC0.91	12
Verification	SummEval	AUROC0.755	8
Verification	FEVER	AUROC0.737	8
Confidence Estimation	SummEval	AUROC0.717	5
Verification	Qwen traces Cross-model Transfer	AUROC0.693	3
Confidence Estimation	Internal Verification Suite GPT-4.1-mini (test)	Q1 Attribution Present (AUROC)0.806	3

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord