Evaluating the Text-to-SQL Capabilities of Large Language Models

About

We perform an empirical evaluation of Text-to-SQL capabilities of the Codex language model. We find that, without any finetuning, Codex is a strong baseline on the Spider benchmark; we also analyze the failure modes of Codex in this setting. Furthermore, we demonstrate on the GeoQuery and Scholar benchmarks that a small number of in-domain examples provided in the prompt enables Codex to perform better than state-of-the-art models finetuned on such few-shot examples.

Nitarshan Rajkumar, Raymond Li, Dzmitry Bahdanau• 2022

Related benchmarks

Task	Dataset	Result
Table Question Answering	WikiTQ	Accuracy52.9	149
Text-to-SQL	Spider (dev)	--	147
Table Fact Verification	TabFact (test)	Accuracy69.7	146
Table Question Answering	WikiTQ (test)	Accuracy61.1	140
Table Question Answering	WikiTableQuestions (test)	Accuracy52.9	86
Fact Verification	TabFact	Accuracy68.37	83
Table-based Fact Verification	TabFact	Accuracy64.71	49
Table Question Answering	STQA-N	Accuracy62.6	20
Table Question Answering	STQA L	Accuracy47.1	20

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord