Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SQL-GEN: Bridging the Dialect Gap for Text-to-SQL Via Synthetic Data And Model Merging

About

Recent advances in Text-to-SQL have largely focused on the SQLite dialect, neglecting the diverse landscape of SQL dialects like BigQuery and PostgreSQL. This limitation is due to the diversity in SQL syntaxes and functions, along with the high cost of collecting and curating SQL-specific training data. To address this, we introduce SQL-GEN, a framework for generating high-quality synthetic training data for any SQL dialect, guided by readily available dialect-specific tutorials. SQL-GEN significantly improves cross-dialect Text-to-SQL performance, boosting execution accuracy by up to 20\% over existing methods. This performance gain narrows the gap with models trained on large-scale human-annotated data. Furthermore, combining synthetic data from SQL-GEN with human-annotated data yields additional improvements of up to 5.6\%. To unify multi-dialect capabilities within a single model, we propose a novel Mixture-of-Experts (MoE) initialization that leverages the shared knowledge across dialects. Our approach merges self-attention layers from dialect-specific models and initializes expert gates using dialect-specific keywords. This leads to a versatile model optimized for multiple SQL dialects, outperforming single-dialect models and significantly enhancing overall performance.

Mohammadreza Pourreza, Ruoxi Sun, Hailong Li, Lesly Miculicich, Tomas Pfister, Sercan O. Arik• 2024

Related benchmarks

TaskDatasetResultRank
Text-to-SQLBIRD (dev)
Execution Accuracy (EA)57.92
387
Text-to-SQLSpider (test)
Execution Accuracy85.32
213
Text-to-SQLSpider (dev)
EX77.56
147
Text-to-SQLSpider-DK
Execution Accuracy (EX)73.9
95
Text-to-SQLSpider-Syn
Execution Accuracy (EX)67.8
79
Text-to-SQLEHRSQL
Execution Accuracy34.51
61
Text-to-SQLSpider-Realistic
Execution Accuracy (EX)70.35
39
Text-to-SQLScience Benchmark
Execution Accuracy46.88
28
Text-to-SQL Data SynthesisBIRD Few Columns (train)
Token Cost (1k)581.4
3
Text-to-SQL Data SynthesisBIRD Medium Columns (train)
Token Cost (1k Tokens)1.32e+3
3
Showing 10 of 11 rows

Other info

Follow for update