Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

About

We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking. For instance, prompting a 540B-parameter language model with just eight chain of thought exemplars achieves state of the art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou• 2022

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag
Accuracy55.89
1460
Mathematical ReasoningGSM8K
Accuracy93.5
983
Code GenerationHumanEval
Pass@189.84
850
Multi-task Language UnderstandingMMLU
Accuracy78.43
842
Mathematical ReasoningGSM8K (test)
Accuracy95.2
797
Commonsense ReasoningWinoGrande
Accuracy63.6
776
Language UnderstandingMMLU
Accuracy83.01
756
Mathematical ReasoningGSM8K (test)
Accuracy93
751
Question AnsweringARC Challenge
Accuracy81.06
749
Commonsense ReasoningPIQA
Accuracy66.1
647
Showing 10 of 1023 rows
...

Other info

Follow for update