The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning

About

Language models (LMs) with less than 100B parameters are known to perform poorly on chain-of-thought (CoT) reasoning in contrast to large LMs when solving unseen tasks. In this work, we aim to equip smaller LMs with the step-by-step reasoning capability by instruction tuning with CoT rationales. In order to achieve this goal, we first introduce a new instruction-tuning dataset called the CoT Collection, which augments the existing Flan Collection (including only 9 CoT tasks) with additional 1.84 million rationales across 1,060 tasks. We show that CoT fine-tuning Flan-T5 (3B & 11B) with CoT Collection enables smaller LMs to have better CoT capabilities on unseen tasks. On the BIG-Bench-Hard (BBH) benchmark, we report an average improvement of +4.34% (Flan-T5 3B) and +2.60% (Flan-T5 11B), in terms of zero-shot task accuracy. Furthermore, we show that instruction tuning with CoT Collection allows LMs to possess stronger few-shot learning capabilities on 4 domain-specific tasks, resulting in an improvement of +2.24% (Flan-T5 3B) and +2.37% (Flan-T5 11B), even outperforming ChatGPT utilizing demonstrations until the max length by a +13.98% margin. Our code, the CoT Collection data, and model checkpoints are publicly available.

Seungone Kim, Se June Joo, Doyoung Kim, Joel Jang, Seonghyeon Ye, Jamin Shin, Minjoon Seo• 2023

Related benchmarks

Task	Dataset	Result
Multilingual Mathematical Reasoning	MGSM 1.0 (test)	Accuracy (ru)10.4	35
Natural Language Inference	EntailmentBank (test)	BLEU53	20
Reasoning and Classification	BBH (Big-Bench Hard) (unseen)	BBH Temporal Sequences28.8	17
Hate speech classification	HateXplain (test)	Macro-F172	13
Hate speech classification	Implicit Hate (test)	Macro F10.64	13
Hate speech classification	Latent Hate (test)	Macro F1 Score61	13
Reasoning	BBH (unseen)	Total Average Score42.38	12
Explanatory Inference	EntailmentBank	BLEU32	12
General Language Understanding	P3 v1 (unseen)	RTE Accuracy80.79	11

Showing 9 of 9 rows

Other info

Code

Follow for update

@wizwand_team Discord