FairJudge: Abstention-Aware Multimodal Judges for Fairness and Alignment Evaluation in Text-to-Image Models

About

Evaluating text-to-image (T2I) systems requires judging not only whether an image matches a prompt, but also whether socially salient attributes are represented faithfully and without unsupported inference. Existing automated evaluators typically rely on face-centric recognizers or contrastive image--text similarity, which provide limited diagnostic feedback and often force predictions even when visual evidence is ambiguous or absent. For fairness-sensitive attributes such as religion and disability, where cues may be contextual, indirect, or intentionally unspecified, these evaluators can therefore miss failure modes that careful human reviewers would notice. We introduce \textsc{FairJudge}, an abstention-aware evaluation protocol that uses instruction-following multimodal LLMs as structured judges for social-attribute prediction, profession grounding, and prompt--image alignment. The protocol constrains outputs to closed label sets, requires visible-evidence rationales, supports an explicit \textsc{unspecified} decision when cues are insufficient, and maps rubric-based alignment judgments to $[-1,1]$. These constraints turn MLLM judging from open-ended assessment into a parseable, auditable evaluation procedure. Across four attribute-prediction benchmarks and three profession/alignment benchmarks, \textsc{FairJudge} outperforms or complements CLIP, DeepFace, VIEScore, and VQAScore. Ablations show that closed labels, abstention, and evidence reporting are central to reliability. We further introduce \textsc{DIVERSIFY} and \textsc{DIVERSIFY-Professions}, two context-rich resources for evaluating social representation and profession grounding beyond face-visible or iconic cues. We release code, prompts, datasets, parser logs, and per-image judge outputs to support reproducible auditing.

Zahraa Al Sahili, Maimuna Nowaz, Maryam Fetanat, Ioannis Patras, Matthew Purver• 2025

Related benchmarks

Task	Dataset	Result
Prompt-image Alignment	DIV-Prof	Alignment Score81	6
Prompt-image Alignment	FairCoT-Prof	Alignment Score71	6
Prompt-image Alignment	IdenProf	Alignment Score70.9	6
Social-attribute prediction	FairFace	Gender Accuracy97	5
Social-attribute prediction	PaTA	Gender Accuracy99	5
Social-attribute prediction	FairCoT	Gender Accuracy99	5
Social-attribute prediction	FairCoT	Macro-F1 (Gender)99	5
Profession Prediction	DIVERSIFY Professions	GMF1 (Age)64	4
Profession Prediction	FC-Prof (FAIRCOT-PROFESSIONS)	GMF1 Score (Age)56	4
Social-attribute prediction	DIVERSIFY	Gender Accuracy99.2	4

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord