C2PO: Diagnosing and Disentangling Bias Shortcuts in LLMs

About

Bias in Large Language Models (LLMs) poses significant risks to trustworthiness, manifesting primarily as stereotypical biases (e.g., gender or racial stereotypes) and structural biases (e.g., lexical overlap or position preferences). However, prior paradigms typically address these in isolation, often mitigating one at the expense of exacerbating the other. To address this, we conduct a systematic exploration of these reasoning failures and identify a primary inducement: the latent spurious feature correlations within the input that drive these erroneous reasoning shortcuts. Driven by these findings, we introduce Causal-Contrastive Preference Optimization (C2PO), a unified alignment framework designed to tackle these specific failures by simultaneously discovering and suppressing these correlations directly within the optimization process. Specifically, C2PO leverages causal counterfactual signals to isolate bias-inducing features from valid reasoning paths, and employs a fairness-sensitive preference update mechanism to dynamically evaluate logit-level contributions and suppress shortcut features. Extensive experiments across multiple benchmarks covering stereotypical bias (BBQ, Unqover), structural bias (MNLI, HANS, Chatbot, MT-Bench), out-of-domain fairness (StereoSet, WinoBias), and general utility (MMLU, GSM8K) demonstrate that C2PO effectively mitigates stereotypical and structural biases while preserving robust general reasoning capabilities.

Xuan Feng, Bo An, Tianlong Gu, Liang Chang, Fengrui Hao, Peipeng Yu, Shuai Zhao• 2025

Related benchmarks

Task	Dataset	Result
Bias Evaluation	BBQ	Accuracy99.3	175
Natural Language Inference	MNLI	Accuracy65.9	36
General Utility Evaluation	MT_Bench	Agreement Rate82.7	33
Natural Language Inference	HANS	Accuracy99.6	23
General Utility Evaluation	Chatbot	Agree Score80	14
Out-of-Domain (OOD) Bias Evaluation	StereoSet	Accuracy67.2	14
Structural Bias Evaluation	MNLI	Accuracy98.1	14
Structural Bias Evaluation	HANS	Accuracy99.6	14
Stereotypical Bias Mitigation	UNQOVER	Accuracy99.9	14
Out-of-Domain (OOD) Bias Evaluation	Winobias	Accuracy0.501	14

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord