Evaluating the Text-to-SQL Capabilities of Large Language Models
About
We perform an empirical evaluation of Text-to-SQL capabilities of the Codex language model. We find that, without any finetuning, Codex is a strong baseline on the Spider benchmark; we also analyze the failure modes of Codex in this setting. Furthermore, we demonstrate on the GeoQuery and Scholar benchmarks that a small number of in-domain examples provided in the prompt enables Codex to perform better than state-of-the-art models finetuned on such few-shot examples.
Nitarshan Rajkumar, Raymond Li, Dzmitry Bahdanau• 2022
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-SQL | Spider (dev) | -- | 100 | |
| Table Question Answering | WikiTQ (test) | Accuracy61.1 | 92 | |
| Table Question Answering | WikiTableQuestions (test) | Accuracy52.9 | 86 | |
| Fact Verification | TabFact | Accuracy68.37 | 73 | |
| Table Question Answering | WikiTQ | Accuracy52.9 | 65 | |
| Table-based Fact Verification | TabFact | Accuracy64.71 | 33 |
Showing 6 of 6 rows