Prompt-Induced Score Variance in Zero-Shot Binary Vision-Language Safety Classification
About
Single-prompt first-token probabilities from zero-shot vision-language model (VLM) safety classifiers are treated as decision scores, but we show they are unreliable under semantically equivalent prompt reformulation: even when the binary label is constrained to a fixed output position, equivalent prompts can induce materially different unsafe probabilities for the same sample. Across multimodal safety benchmarks and multiple VLM families, cross-prompt variance is strongly associated with prompt-level disagreement and higher error, making it a useful fragility diagnostic. A training-free mean ensemble improves NLL on all 14 dataset-model evaluation pairs and ECE on 12/14 relative to a train-selected single-prompt baseline, and wins more head-to-head NLL comparisons than labeled temperature scaling, Platt scaling, and isotonic regression applied to the same prompt. Ranking gains are consistent against the train-selected baseline on both AUROC and AUPRC, and against the full 15-prompt distribution remain consistent on AUPRC while softening on AUROC. Labeled calibration on top of the mean provides further gains when labels are available, identifying prompt averaging as a strong label-free first stage rather than a replacement for calibration. We frame this as a reliability stress test for zero-shot VLM first-token safety scores and recommend prompt-family evaluation with mean aggregation as a standard label-free reliability baseline.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Harmful Content Detection | UnsafeBench | AUPRC71.7 | 49 | |
| Harmful Content Detection | HoliSafe-Bench | AUPRC75.6 | 49 | |
| Safety Classification | UnsafeBench | AUROC80.5 | 49 | |
| Safety Classification | HoliSafe-Bench | AUROC0.783 | 49 | |
| Safety Classification | UnsafeBench | ECE0.061 | 21 | |
| Safety Classification | HoliSafe-Bench | ECE8.4 | 21 |