Large Language Models Can Self-Improve

About

Large Language Models (LLMs) have achieved excellent performances in various tasks. However, fine-tuning an LLM requires extensive supervision. Human, on the other hand, may improve their reasoning abilities by self-thinking without external inputs. In this work, we demonstrate that an LLM is also capable of self-improving with only unlabeled datasets. We use a pre-trained LLM to generate "high-confidence" rationale-augmented answers for unlabeled questions using Chain-of-Thought prompting and self-consistency, and fine-tune the LLM using those self-generated solutions as target outputs. We show that our approach improves the general reasoning ability of a 540B-parameter LLM (74.4%->82.1% on GSM8K, 78.2%->83.0% on DROP, 90.0%->94.4% on OpenBookQA, and 63.4%->67.9% on ANLI-A3) and achieves state-of-the-art-level performance, without any ground truth label. We conduct ablation studies and show that fine-tuning on reasoning is critical for self-improvement.

Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, Jiawei Han• 2022

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K	Accuracy82.1	1424
Mathematical Reasoning	GSM8K (test)	Accuracy82.1	816
Question Answering	OpenBookQA	Accuracy94.4	465
Science Question Answering	ARC-C	Accuracy89.8	268
Reading Comprehension	DROP	DROP Accuracy83	138
Natural Language Inference	ANLI Round 2	Accuracy66.5	64
Natural Language Inference	ANLI Round 3	Accuracy67.9	64
Scientific Reasoning	SciKnowEval	Chemistry Accuracy73.72	56
Multi-modal Question Answering	MedXpertQA-MM	Accuracy22.01	38
General Knowledge	MMLU	pass@171.74	31

Showing 10 of 19 rows

Other info

Code

Follow for update

@wizwand_team Discord