Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Automatic Chain of Thought Prompting in Large Language Models

About

Large language models (LLMs) can perform complex reasoning by generating intermediate reasoning steps. Providing these steps for prompting demonstrations is called chain-of-thought (CoT) prompting. CoT prompting has two major paradigms. One leverages a simple prompt like "Let's think step by step" to facilitate step-by-step thinking before answering a question. The other uses a few manual demonstrations one by one, each composed of a question and a reasoning chain that leads to an answer. The superior performance of the second paradigm hinges on the hand-crafting of task-specific demonstrations one by one. We show that such manual efforts may be eliminated by leveraging LLMs with the "Let's think step by step" prompt to generate reasoning chains for demonstrations one by one, i.e., let's think not just step by step, but also one by one. However, these generated chains often come with mistakes. To mitigate the effect of such mistakes, we find that diversity matters for automatically constructing demonstrations. We propose an automatic CoT prompting method: Auto-CoT. It samples questions with diversity and generates reasoning chains to construct demonstrations. On ten public benchmark reasoning tasks with GPT-3, Auto-CoT consistently matches or exceeds the performance of the CoT paradigm that requires manual designs of demonstrations. Code is available at https://github.com/amazon-research/auto-cot

Zhuosheng Zhang, Aston Zhang, Mu Li, Alex Smola• 2022

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K
Accuracy71.4
1362
Visual Question AnsweringVQA v2
Accuracy66.79
1362
Visual Question AnsweringTextVQA
Accuracy63.64
1285
Mathematical ReasoningMATH (test)
Overall Accuracy72.78
433
Multi-hop Question Answering2WikiMultihopQA--
387
Visual Question AnsweringScienceQA
Accuracy74.09
370
Commonsense ReasoningCSQA
Accuracy79.4
366
Visual Question AnsweringOK-VQA
Accuracy48.13
260
Multi-hop Question AnsweringHotpotQA (test)--
255
Sentiment ClassificationSST2 (test)
Accuracy88.65
233
Showing 10 of 62 rows

Other info

Code

Follow for update