Mitigating Biases for Instruction-following Language Models via Bias Neurons Elimination

About

Instruction-following language models often show undesirable biases. These undesirable biases may be accelerated in the real-world usage of language models, where a wide range of instructions is used through zero-shot example prompting. To solve this problem, we first define the bias neuron, which significantly affects biased outputs, and prove its existence empirically. Furthermore, we propose a novel and practical bias mitigation method, CRISPR, to eliminate bias neurons of language models in instruction-following settings. CRISPR automatically determines biased outputs and categorizes neurons that affect the biased outputs as bias neurons using an explainability method. Experimental results demonstrate the effectiveness of our method in mitigating biases under zero-shot instruction-following settings without losing the model's task performance and existing knowledge. The experimental results reveal the generalizability of our method as it shows robustness under various instructions and datasets. Surprisingly, our method can mitigate the bias in language models by eliminating only a few neurons (at least three).

Nakyeong Yang, Taegwan Kang, Jungkyu Choi, Honglak Lee, Kyomin Jung• 2023

Related benchmarks

Task	Dataset	Result
Question Answering	ARC Challenge	Accuracy86.86	906
Natural Language Understanding	GLUE (test)	--	416
Question Answering	ARC Easy	Normalized Acc94.02	391
Question Answering	OBQA	Accuracy71.96	347
Question Answering	COPA	Accuracy83.3	59
Question Answering	BBQ (Bias Benchmark for QA) v1.0 (test)	BBQ SES Score93.1	16
Bias Mitigation	FairMT-Bench	Anaphora Elipsis Score37	12
Bias Mitigation	BBQ SingleTurn	Age Bias22.1	12
Bias Mitigation	F^2-Bench	Accuracy (Age)27.5	12
Bias Mitigation	PCT SingleTurn	English PCT SingleTurn Score21.7	12

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord