Automatic Model Selection with Large Language Models for Reasoning

About

Chain-of-Thought (CoT) and Program-Aided Language Models (PAL) represent two distinct reasoning methods, each with its own strengths. CoT employs natural language, offering flexibility and interpretability, while PAL utilizes programming language, yielding more structured and rigorous logic. We introduce a model selection method to combine the best of both worlds by employing a large language model (LLM) to dynamically select between them. Our theoretical analysis underscores the feasibility of this method, which is further corroborated by empirical results. Our proposed method demonstrates significant performance improvements across eight reasoning datasets with Codex, ChatGPT, and GPT-4. Additionally, our method is complementary to self-consistency; when integrated, it can further enhance performance while significantly reducing computation costs. Moreover, we achieve new state-of-the-art results on GSM8K and SVAMP, with respective accuracies of 96.8% and 93.7%. Our code, data and prompts are available at https://github.com/XuZhao0/Model-Selection-Reasoning

James Xu Zhao, Yuxi Xie, Kenji Kawaguchi, Junxian He, Michael Qizhe Xie• 2023

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K (test)	Accuracy90.1	954
Mathematical Reasoning	GSM8K (test)	Accuracy80.8	816
Mathematical Reasoning	SVAMP (test)	Accuracy93.7	293
Arithmetic Reasoning	MultiArith	Accuracy99.7	293
Arithmetic Reasoning	GSM8K	Accuracy95.6	272
Arithmetic Reasoning	GSM8K (test)	Accuracy96.8	189
Arithmetic Reasoning	ADDSUB	Accuracy95.7	149
Arithmetic Reasoning	MultiArith (test)	Accuracy99	115
Mathematical Reasoning	CollegeMath (test)	Accuracy46.7	94
Mathematical Reasoning	MAWPS (test)	Accuracy95.3	87

Showing 10 of 21 rows

Other info

Code

Follow for update

@wizwand_team Discord