Evaluating the Text-to-SQL Capabilities of Large Language Models
About
We perform an empirical evaluation of Text-to-SQL capabilities of the Codex language model. We find that, without any finetuning, Codex is a strong baseline on the Spider benchmark; we also analyze the failure modes of Codex in this setting. Furthermore, we demonstrate on the GeoQuery and Scholar benchmarks that a small number of in-domain examples provided in the prompt enables Codex to perform better than state-of-the-art models finetuned on such few-shot examples.
Nitarshan Rajkumar, Raymond Li, Dzmitry Bahdanau• 2022
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Table Question Answering | WikiTQ | Accuracy52.9 | 149 | |
| Text-to-SQL | Spider (dev) | -- | 147 | |
| Table Fact Verification | TabFact (test) | Accuracy69.7 | 146 | |
| Table Question Answering | WikiTQ (test) | Accuracy61.1 | 140 | |
| Table Question Answering | WikiTableQuestions (test) | Accuracy52.9 | 86 | |
| Fact Verification | TabFact | Accuracy68.37 | 83 | |
| Table-based Fact Verification | TabFact | Accuracy64.71 | 49 | |
| Table Question Answering | STQA-N | Accuracy62.6 | 20 | |
| Table Question Answering | STQA L | Accuracy47.1 | 20 |
Showing 9 of 9 rows