Prototypicality Bias Reveals Blindspots in Multimodal Evaluation Metrics
About
Automatic metrics are widely used to evaluate text-to-image models, often replacing human judgment in benchmarking, model selection, and large-scale data filtering. Yet they may reward images that look plausible or prototypical rather than images that faithfully satisfy the prompt. We identify prototypicality bias as a systematic blindspot in multimodal evaluation: metrics can prefer a semantically incorrect but visually or socially prototypical image over a correct but less prototypical one. We introduce PROTOBIAS, a controlled diagnostic benchmark across Animals, Objects, and Demography, where semantically correct images are contrasted with plausible prototypical adversaries containing a single controlled semantic violation. Grounded in prototype theory and social-category prototypicality, PROTOBIAS is constructed with multiple prompt generators, image generators, and independent VLM filters, and validated through prompt-quality, human-annotation, and image-quality controls. Using PROTOBIAS, we show that widely used embedding, reward, VQA-based, and VLM-as-judge metrics frequently fail these contrasts, while human judgments remain more faithful to semantic correctness. We further introduce PROTOSCORE, a lightweight contrastively trained evaluator, as an initial mitigation baseline. PROTOBIAS provides a focused benchmark for measuring prototypicality-driven metric failures and developing more semantically faithful T2I evaluators.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image-text alignment | ProtoBias Demography | SC75 | 7 | |
| Image-text alignment | ProtoBias Animals (300 human annotated samples) | SC Score0.83 | 7 | |
| Image-text alignment | ProtoBias Objects (300 human annotated samples) | SC Score0.89 | 7 | |
| Prototypicality Bias Evaluation | ProtoBias (Animals) | Correct Ranking Margin0.361 | 6 | |
| Prototypicality Bias Evaluation | ProtoBias Demography | Correct Ranking Margin0.358 | 6 | |
| Prototypicality Bias Evaluation | ProtoBias Objects | Correct Ranking Margin0.346 | 6 |