Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VERDI: Single-Call Confidence Estimation for Verification-Based LLM Judges via Decomposed Inference

About

LLM-as-Judge systems are widely deployed for automated evaluation, yet practitioners lack reliable methods to know when a judge's verdict should be trusted. Token log-probabilities, the standard post-hoc confidence signal, are unavailable for many commercial LLMs and, even when accessible, saturate above 0.999 with structured JSON output. We introduce VERDI (VERification-Decomposed Inference), a method that extracts confidence from the reasoning trace a structured judge already produces, with no additional inference calls. VERDI decomposes each verification-style evaluation into sub-checks and derives three structural signals: Step-Verdict Alignment, Claim-Level Margin, and Evidence Grounding Score. We combine them with Platt-scaled logistic regression. On three public benchmarks, VERDI achieves AUROC 0.72-0.91 on GPT-4.1-mini and 0.66-0.80 on GPT-5.4-mini. On Qwen3.5-4B/9B/27B, where answer-token logprobs are anti-calibrated (higher confidence on errors, AUROC 0.32-0.49), VERDI achieves 0.56-0.70. We additionally validate on a production system with eight rubrics (AUROC 0.73-0.88 on factual rubrics), demonstrate cross-model transfer (AUROC 0.66-0.69), and show that a 33M-parameter NLI (Natural Language Inference) model provides a scalable alternative to regex extraction.

Jasmine Qi, Danylo Dantsev, Muyang Sun• 2026

Related benchmarks

TaskDatasetResultRank
Scientific Fact VerificationSciFact--
25
Sentence-Level Confidence PredictionFEVER
AUROC0.737
15
Sentence-Level Confidence PredictionSciFact
AUROC0.91
12
VerificationSummEval
AUROC0.755
8
VerificationFEVER
AUROC0.737
8
Confidence EstimationSummEval
AUROC0.717
5
VerificationQwen traces Cross-model Transfer
AUROC0.693
3
Confidence EstimationInternal Verification Suite GPT-4.1-mini (test)
Q1 Attribution Present (AUROC)0.806
3
Showing 8 of 8 rows

Other info

Follow for update