Debating with More Persuasive LLMs Leads to More Truthful Answers

About

Common methods for aligning large language models (LLMs) with desired behaviour heavily rely on human-labelled data. However, as models grow increasingly sophisticated, they will surpass human expertise, and the role of human evaluation will evolve into non-experts overseeing experts. In anticipation of this, we ask: can weaker models assess the correctness of stronger models? We investigate this question in an analogous setting, where stronger models (experts) possess the necessary information to answer questions and weaker models (non-experts) lack this information. The method we evaluate is debate, where two LLM experts each argue for a different answer, and a non-expert selects the answer. We find that debate consistently helps both non-expert models and humans answer questions, achieving 76% and 88% accuracy respectively (naive baselines obtain 48% and 60%). Furthermore, optimising expert debaters for persuasiveness in an unsupervised manner improves non-expert ability to identify the truth in debates. Our results provide encouraging empirical evidence for the viability of aligning models with debate in the absence of ground truth.

Akbir Khan, John Hughes, Dan Valentine, Laura Ruis, Kshitij Sachan, Ansh Radhakrishnan, Edward Grefenstette, Samuel R. Bowman, Tim Rockt\"aschel, Ethan Perez• 2024

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH 500	Top-1 Accuracy80.15	452
Arithmetic Reasoning	MultiArith	Accuracy96.27	324
Mathematical Reasoning	GSM8K	--	220
Math Reasoning	AQUA	Accuracy77.52	194
General Reasoning	BBH	Accuracy86.2	190
Language Understanding	MMLU	MMLU Accuracy83.69	144
Long-context Reasoning	LongBench	Accuracy (LongBench)65.2	101
Language Understanding	MMLU CF	Score73	66
Reasoning	MMLU	Accuracy83.7	57
Reasoning	GSM8K	Accuracy (GSM8K)90.2	55

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord