Prompt-Induced Score Variance in Zero-Shot Binary Vision-Language Safety Classification

About

Single-prompt first-token probabilities from zero-shot vision-language model (VLM) safety classifiers are treated as decision scores, but we show they are unreliable under semantically equivalent prompt reformulation: even when the binary label is constrained to a fixed output position, equivalent prompts can induce materially different unsafe probabilities for the same sample. Across multimodal safety benchmarks and multiple VLM families, cross-prompt variance is strongly associated with prompt-level disagreement and higher error, making it a useful fragility diagnostic. A training-free mean ensemble improves NLL on all 14 dataset-model evaluation pairs and ECE on 12/14 relative to a train-selected single-prompt baseline, and wins more head-to-head NLL comparisons than labeled temperature scaling, Platt scaling, and isotonic regression applied to the same prompt. Ranking gains are consistent against the train-selected baseline on both AUROC and AUPRC, and against the full 15-prompt distribution remain consistent on AUPRC while softening on AUROC. Labeled calibration on top of the mean provides further gains when labels are available, identifying prompt averaging as a strong label-free first stage rather than a replacement for calibration. We frame this as a reliability stress test for zero-shot VLM first-token safety scores and recommend prompt-family evaluation with mean aggregation as a standard label-free reliability baseline.

Charles Weng, Dingwen Li, Alexander Martin• 2026

Related benchmarks

Task	Dataset	Result
Harmful Content Detection	UnsafeBench	AUPRC71.7	61
Harmful Content Detection	HoliSafe-Bench	AUPRC75.6	49
Safety Classification	UnsafeBench	AUROC80.5	49
Safety Classification	HoliSafe-Bench	AUROC0.783	49
Safety Classification	UnsafeBench	ECE0.061	21
Safety Classification	HoliSafe-Bench	ECE8.4	21

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord