Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

About

We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking. For instance, prompting a 540B-parameter language model with just eight chain of thought exemplars achieves state of the art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou• 2022

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	HellaSwag	Accuracy55.89	1896
Commonsense Reasoning	WinoGrande	Accuracy63.6	1442
Mathematical Reasoning	GSM8K	Accuracy95.1	1398
Node Classification	Cora	Accuracy63.02	1215
Code Generation	HumanEval	Pass@189.84	1043
Mathematical Reasoning	GSM8K (test)	Accuracy95.2	954
Question Answering	ARC Challenge	Accuracy81.06	906
Mathematical Reasoning	MATH500 (test)	Accuracy90.5	895
Mathematical Reasoning	MATH	Accuracy85.4	882
Multi-task Language Understanding	MMLU	Accuracy78.43	881

Showing 10 of 2072 rows

...

Other info

Follow for update

@wizwand_team Discord