Confidence Calibration in Vision-Language-Action Models
About
Trustworthy robot behavior requires not only high levels of task success but also that the robot can reliably quantify how likely it is to succeed. To this end, we present a first-of-its-kind study of confidence calibration in vision-language-action (VLA) foundation models, which map visual observations and natural language instructions to low-level robot motor commands. We establish a confidence baseline for VLAs, examine how task success relates to calibration error and how calibration evolves over time, and introduce two lightweight techniques to remedy the miscalibration we observe: prompt ensembles and action-wise Platt scaling. Our aim in this study is to begin to develop the tools and conceptual understanding necessary to render VLAs both highly performant and highly trustworthy via reliable uncertainty quantification.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Confidence Estimation | VLCB Pooled Aggregate (test) | ECE12.77 | 48 | |
| Failure Detection | LIBERO Unseen | Brier Score0.218 | 37 | |
| Failure Detection | LIBERO seen | Brier Score0.212 | 37 | |
| Large Vision-Language Model Evaluation | Unweighted Average | ECE31.9 | 29 | |
| Vision-Language Question Answering | Pooled Shared (GQA, POPE, LLaVA-Wild, MMMU Pro, GMAI-MMBench, MME-Finance) (test) | Expected Calibration Error (ECE)12.8 | 22 | |
| Failure Detection | WidowX seen | Brier Score0.255 | 11 | |
| Failure Detection | WidowX (unseen) | Brier Score0.257 | 11 | |
| Failure Detection | Franka seen | Brier Score0.29 | 11 | |
| Failure Detection | Franka unseen | Brier Score0.294 | 11 | |
| Confidence Estimation | LIBERO Online Execution spatial object goal | ECE0.0276 | 8 |