Second Thoughts are Best: Learning to Re-Align With Human Values from Text Edits

About

We present Second Thought, a new learning paradigm that enables language models (LMs) to re-align with human values. By modeling the chain-of-edits between value-unaligned and value-aligned text, with LM fine-tuning and additional refinement through reinforcement learning, Second Thought not only achieves superior performance in three value alignment benchmark datasets but also shows strong human-value transfer learning ability in few-shot scenarios. The generated editing steps also offer better interpretability and ease for interactive error correction. Extensive human evaluations further confirm its effectiveness.

Ruibo Liu, Chenyan Jia, Ge Zhang, Ziyu Zhuang, Tony X Liu, Soroush Vosoughi• 2023

Related benchmarks

Task	Dataset	Result
Value Alignment	Moral Stories (test)	Align Score4.85	10
Value Alignment	MIC (test)	Align Score5.48	10
Value Alignment	Ethics (test)	Align Score5.57	10
Dialogue Generation	Movie Dic (test)	ROUGE-L17.35	5
Dialogue Generation	DSTC-8 Reddit (test)	R-L Score12.56	5
Text Generation	Cornell IMDB (test)	ROUGE-L22.47	5

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord