Self-Discover: Large Language Models Self-Compose Reasoning Structures

About

We introduce SELF-DISCOVER, a general framework for LLMs to self-discover the task-intrinsic reasoning structures to tackle complex reasoning problems that are challenging for typical prompting methods. Core to the framework is a self-discovery process where LLMs select multiple atomic reasoning modules such as critical thinking and step-by-step thinking, and compose them into an explicit reasoning structure for LLMs to follow during decoding. SELF-DISCOVER substantially improves GPT-4 and PaLM 2's performance on challenging reasoning benchmarks such as BigBench-Hard, grounded agent reasoning, and MATH, by as much as 32% compared to Chain of Thought (CoT). Furthermore, SELF-DISCOVER outperforms inference-intensive methods such as CoT-Self-Consistency by more than 20%, while requiring 10-40x fewer inference compute. Finally, we show that the self-discovered reasoning structures are universally applicable across model families: from PaLM 2-L to GPT-4, and from GPT-4 to Llama2, and share commonalities with human reasoning patterns.

Pei Zhou, Jay Pujara, Xiang Ren, Xinyun Chen, Heng-Tze Cheng, Quoc V. Le, Ed H. Chi, Denny Zhou, Swaroop Mishra, Huaixiu Steven Zheng• 2024

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	SVAMP	Accuracy17.33	403
Commonsense Reasoning	CSQA	Accuracy57.33	366
Mathematical Reasoning	GSM8K	Accuracy (GSM8K)7.43	358
Math Reasoning	GSM8K (test)	Accuracy56.33	250
General Reasoning	BBH	Accuracy91	190
Question Answering	StrategyQA	Accuracy85	123
Math	MATH 500	Accuracy90.3	120
General Reasoning	BIG-Bench Hard	--	68
Mathematics	AIME 2025	Accuracy48.3	66
Reasoning	BIG-Bench Hard (BBH) (test)	Average Accuracy56.9	62

Showing 10 of 29 rows

Other info

Follow for update

@wizwand_team Discord