The Capacity for Moral Self-Correction in Large Language Models

About

We test the hypothesis that language models trained with reinforcement learning from human feedback (RLHF) have the capability to "morally self-correct" -- to avoid producing harmful outputs -- if instructed to do so. We find strong evidence in support of this hypothesis across three different experiments, each of which reveal different facets of moral self-correction. We find that the capability for moral self-correction emerges at 22B model parameters, and typically improves with increasing model size and RLHF training. We believe that at this level of scale, language models obtain two capabilities that they can use for moral self-correction: (1) they can follow instructions and (2) they can learn complex normative concepts of harm like stereotyping, bias, and discrimination. As such, they can follow instructions to avoid certain kinds of morally harmful outputs. We believe our results are cause for cautious optimism regarding the ability to train language models to abide by ethical principles.

Deep Ganguli, Amanda Askell, Nicholas Schiefer, Thomas I. Liao, Kamil\.e Luko\v{s}i\=ut\.e, Anna Chen, Anna Goldie, Azalia Mirhoseini, Catherine Olsson, Danny Hernandez, Dawn Drain, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jackson Kernion, Jamie Kerr, Jared Mueller, Joshua Landau, Kamal Ndousse, Karina Nguyen, Liane Lovitt, Michael Sellitto, Nelson Elhage, Noemi Mercado, Nova DasSarma, Oliver Rausch, Robert Lasenby, Robin Larson, Sam Ringer, Sandipan Kundu, Saurav Kadavath, Scott Johnston, Shauna Kravec, Sheer El Showk, Tamera Lanham, Timothy Telleen-Lawton, Tom Henighan, Tristan Hume, Yuntao Bai, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, Christopher Olah, Jack Clark, Samuel R. Bowman, Jared Kaplan• 2023

Related benchmarks

Task	Dataset	Result
Safety Evaluation	DoNotAnswer Framed	HRR0.449	96
In-Context Value Alignment	Value Composition (Overall)	Confucianism Score3.995	37
Value Alignment	Confucianism-4	Conformity Score3.842	22
Value Alignment	HH Balance-8	Conformity Score4.204	17
Value Alignment	Helpfulness 4	Conformity Score4.364	16
Value Alignment	Harmlessness 4	Conformity Score4.083	16
Value Alignment	Liberalism 4	Conformity Score3.167	11
In-Context Value Alignment	VALUE PORTRAIT Liberalism-4 (OOD)	Conformity Score3.184	5
Human Evaluation of Value Alignment	Value Composition Human Study	Confucianism Score2.928	5

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord