Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

FairJudge: Abstention-Aware Multimodal Judges for Fairness and Alignment Evaluation in Text-to-Image Models

About

Evaluating text-to-image (T2I) systems requires judging not only whether an image matches a prompt, but also whether socially salient attributes are represented faithfully and without unsupported inference. Existing automated evaluators typically rely on face-centric recognizers or contrastive image--text similarity, which provide limited diagnostic feedback and often force predictions even when visual evidence is ambiguous or absent. For fairness-sensitive attributes such as religion and disability, where cues may be contextual, indirect, or intentionally unspecified, these evaluators can therefore miss failure modes that careful human reviewers would notice. We introduce \textsc{FairJudge}, an abstention-aware evaluation protocol that uses instruction-following multimodal LLMs as structured judges for social-attribute prediction, profession grounding, and prompt--image alignment. The protocol constrains outputs to closed label sets, requires visible-evidence rationales, supports an explicit \textsc{unspecified} decision when cues are insufficient, and maps rubric-based alignment judgments to $[-1,1]$. These constraints turn MLLM judging from open-ended assessment into a parseable, auditable evaluation procedure. Across four attribute-prediction benchmarks and three profession/alignment benchmarks, \textsc{FairJudge} outperforms or complements CLIP, DeepFace, VIEScore, and VQAScore. Ablations show that closed labels, abstention, and evidence reporting are central to reliability. We further introduce \textsc{DIVERSIFY} and \textsc{DIVERSIFY-Professions}, two context-rich resources for evaluating social representation and profession grounding beyond face-visible or iconic cues. We release code, prompts, datasets, parser logs, and per-image judge outputs to support reproducible auditing.

Zahraa Al Sahili, Maimuna Nowaz, Maryam Fetanat, Ioannis Patras, Matthew Purver• 2025

Related benchmarks

TaskDatasetResultRank
Prompt-image AlignmentDIV-Prof
Alignment Score81
6
Prompt-image AlignmentFairCoT-Prof
Alignment Score71
6
Prompt-image AlignmentIdenProf
Alignment Score70.9
6
Social-attribute predictionFairFace
Gender Accuracy97
5
Social-attribute predictionPaTA
Gender Accuracy99
5
Social-attribute predictionFairCoT
Gender Accuracy99
5
Social-attribute predictionFairCoT
Macro-F1 (Gender)99
5
Profession PredictionDIVERSIFY Professions
GMF1 (Age)64
4
Profession PredictionFC-Prof (FAIRCOT-PROFESSIONS)
GMF1 Score (Age)56
4
Social-attribute predictionDIVERSIFY
Gender Accuracy99.2
4
Showing 10 of 12 rows

Other info

Follow for update