Self-Debias: Self-correcting for Debiasing Large Language Models

About

Although Large Language Models (LLMs) demonstrate remarkable reasoning capabilities, inherent social biases often cascade throughout the Chain-of-Thought (CoT) process, leading to continuous "Bias Propagation". Existing debiasing methods primarily focus on static constraints or external interventions, failing to identify and interrupt this propagation once triggered. To address this limitation, we introduce Self-Debias, a progressive framework designed to instill intrinsic self-correction capabilities. Specifically, we reformulate the debiasing process as a strategic resource redistribution problem, treating the model's output probability mass as a limited resource to be reallocated from biased heuristics to unbiased reasoning paths. Unlike standard preference optimization which applies broad penalties, Self-Debias employs a fine-grained trajectory-level objective subject to dynamic debiasing constraints. This enables the model to selectively revise biased reasoning suffixes while preserving valid contextual prefixes. Furthermore, we integrate an online self-improvement mechanism utilizing consistency filtering to autonomously synthesize supervision signals. With merely 20k annotated samples, Self-Debias activates efficient self-correction, achieving superior debiasing performance while preserving general reasoning capabilities without continuous external oversight.

Xuan Feng, Shuai Zhao, Luwei Xiao, Tianlong Gu, Bo An• 2026

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	ARC Challenge	Accuracy93.8	243
Bias Evaluation	BBQ	--	171
Fairness evaluation	UNQOVER	Score99.6	16
Fairness evaluation	CEB Adult	Score68.3	16
Fairness evaluation	CEB-Credit	Score65.8	16
Fairness evaluation	CEB-Jigsaw	Score73.5	16
Fairness evaluation	CrowS-Pairs	Score72.2	16
Multi-task Evaluation	Fairness and Utility Suite	Average Score82.1	16
Fairness and Utility Evaluation	Fairness and Utility Benchmarks (BBQ, UnQover, CEB-Adult, CEB-Credit, CEB-Jigsaw, CrowS, ARC-C, GSM8K)	BBQ Accuracy97.1	8

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord