Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP

About

When trained on large, unfiltered crawls from the internet, language models pick up and reproduce all kinds of undesirable biases that can be found in the data: they often generate racist, sexist, violent or otherwise toxic language. As large models require millions of training examples to achieve good performance, it is difficult to completely prevent them from being exposed to such content. In this paper, we first demonstrate a surprising finding: pretrained language models recognize, to a considerable degree, their undesirable biases and the toxicity of the content they produce. We refer to this capability as self-diagnosis. Based on this finding, we then propose a decoding algorithm that, given only a textual description of the undesired behavior, reduces the probability of a language model producing problematic text. We refer to this approach as self-debiasing. Self-debiasing does not rely on manually curated word lists, nor does it require any training data or changes to the model's parameters. While we by no means eliminate the issue of language models generating biased text, we believe our approach to be an important step in this direction.

Timo Schick, Sahana Udupa, Hinrich Sch\"utze• 2021

Related benchmarks

Task	Dataset	Result
Math Reasoning	GSM8K	Accuracy50.8	254
Bias Evaluation	BBQ	--	175
General Knowledge Evaluation	MMLU	MMLU Accuracy41	167
Language Modeling	WikiText-2	Perplexity (PPL)14.1	146
Memory-based Bias Reduction	Bias Reduction Benchmark Memory	Bias Reduction Performance43.2	35
Evaluation-based Bias Reduction	Bias Reduction Benchmark (Evaluation)	Bias Reduction Performance69.8	35
Memory Fidelity Evaluation	Memory-based Experiment Seen Features	P-Diff0.1	32
Toxicity Evaluation	BoLD	--	26
Bias Measurement	StereoSet	Overall SS59.34	25
Commonsense Reasoning	StrategyQA	Accuracy (%)73	24

Showing 10 of 32 rows

Other info

Follow for update

@wizwand_team Discord