Confidence Calibration in Vision-Language-Action Models

About

Trustworthy robot behavior requires not only high levels of task success but also that the robot can reliably quantify how likely it is to succeed. To this end, we present a first-of-its-kind study of confidence calibration in vision-language-action (VLA) foundation models, which map visual observations and natural language instructions to low-level robot motor commands. We establish a confidence baseline for VLAs, examine how task success relates to calibration error and how calibration evolves over time, and introduce two lightweight techniques to remedy the miscalibration we observe: prompt ensembles and action-wise Platt scaling. Our aim in this study is to begin to develop the tools and conceptual understanding necessary to render VLAs both highly performant and highly trustworthy via reliable uncertainty quantification.

Thomas P Zollo, Richard Zemel• 2025

Related benchmarks

Task	Dataset	Result
Confidence Estimation	VLCB Pooled Aggregate (test)	ECE12.77	48
Failure Detection	LIBERO Unseen	Brier Score0.218	37
Failure Detection	LIBERO seen	Brier Score0.212	37
Large Vision-Language Model Evaluation	Unweighted Average	ECE31.9	29
Vision-Language Question Answering	Pooled Shared (GQA, POPE, LLaVA-Wild, MMMU Pro, GMAI-MMBench, MME-Finance) (test)	Expected Calibration Error (ECE)12.8	22
Failure Detection	WidowX seen	Brier Score0.255	11
Failure Detection	WidowX (unseen)	Brier Score0.257	11
Failure Detection	Franka seen	Brier Score0.29	11
Failure Detection	Franka unseen	Brier Score0.294	11
Robot Manipulation	LIBERO-Plus	Success Rate33.96	11

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord