CAFP: A Post-Processing Framework for Group Fairness via Counterfactual Model Averaging

About

Ensuring fairness in machine learning predictions is a critical challenge, especially when models are deployed in sensitive domains such as credit scoring, healthcare, and criminal justice. While many fairness interventions rely on data preprocessing or algorithmic constraints during training, these approaches often require full control over the model architecture and access to protected attribute information, which may not be feasible in real-world systems. In this paper, we propose Counterfactual Averaging for Fair Predictions (CAFP), a model-agnostic post-processing method that mitigates unfair influence from protected attributes without retraining or modifying the original classifier. CAFP operates by generating counterfactual versions of each input in which the sensitive attribute is flipped, and then averaging the model's predictions across factual and counterfactual instances. We provide a theoretical analysis of CAFP, showing that it eliminates direct dependence on the protected attribute, reduces mutual information between predictions and sensitive attributes, and provably bounds the distortion introduced relative to the original model. Under mild assumptions, we further show that CAFP achieves perfect demographic parity and reduces the equalized odds gap by at least half the average counterfactual bias.

Irina Ar\'evalo, Marcos Oliva• 2026

Related benchmarks

Task	Dataset	Result
Classification	German Credit (test)	Accuracy75.41	28
Binary Classification	COMPAS	Accuracy66.53	21
Classification Fairness	Adult	AOD-0.0791	12
Fair Classification	COMPAS	AOD-0.1305	12
Fairness Mitigation	German Credit	AOD-0.0324	12
Fairness Mitigation Evaluation	German Credit (test)	DPD-0.0505	12
Fairness Classification	Adult dataset	DPD-0.1837	12

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord