Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

KLAAD: Refining Attention Mechanisms to Reduce Societal Bias in Generative Language Models

About

Large language models (LLMs) often exhibit societal biases in their outputs, prompting ethical concerns regarding fairness and harm. In this work, we propose KLAAD (KL-Attention Alignment Debiasing), an attention-based debiasing framework that implicitly aligns attention distributions between stereotypical and anti-stereotypical sentence pairs without directly modifying model weights. KLAAD introduces a composite training objective combining Cross-Entropy, KL divergence, and Triplet losses, guiding the model to consistently attend across biased and unbiased contexts while preserving fluency and coherence. Experimental evaluation of KLAAD demonstrates improved bias mitigation on both the BBQ and BOLD benchmarks, with minimal impact on language modeling quality. The results indicate that attention-level alignment offers a principled solution for mitigating bias in generative language models.

Seorin Kim, Dongyoung Lee, Jaejin Lee• 2025

Related benchmarks

TaskDatasetResultRank
Utility EvaluationAnchor Utility Dataset
Anchor-PPL20.36
16
Debiasing EffectivenessOut-of-Distribution (OOD) Split
Mean Ratio1.61
16
Mechanism AnalysisModel Internal Representations
Edge Delta Specification0.1572
16
Debiasing EffectivenessIn-Distribution (ID)
Mean Effectiveness Score (ID)1.08
16
Safety EvaluationAnchor Safety Dataset
Anchor Accuracy100
16
Bias EvaluationHolisticBias--
10
Large Language Model DebiasingBBQ and CrowS-Pairs Out-of-Distribution (test)
Mean Bias0.98
9
Large Language Model DebiasingBBQ and CrowS-Pairs In-Distribution (test)
Mean Bias1.03
9
Bias EvaluationBBQ Gender
Ambiguity Score47.2
4
Bias EvaluationBoLD
Bias Score1.267
4
Showing 10 of 13 rows

Other info

Follow for update