Knowledge Distillation for Low-Resource Open-source Text-to-SQL Model

About

Text-to-SQL converts natural language questions into executable SQL queries, enabling non-technical users to access relational databases for analytics and intelligent data services. In real-world scenarios, performance is often constrained by low-resource settings, where high-quality annotated \texttt{<question, SQL>} pairs are scarce, particularly for domain-specific databases. Additional challenges include opaque schema definitions, abbreviations, and implicit business logic that are not explicitly encoded in the schema. Existing data synthesis and prompting techniques improve coverage but often fail to produce task-specific, semantically grounded examples aligned with database constraints. To address these challenges, we propose a knowledge-aware Text-to-SQL framework that constructs task-specific knowledge base including schema semantics, abbreviations, business logic, and query patterns, and injects them into both training and inference. This framework generates diverse, contextually grounded synthetic training data and enhances inference through targeted knowledge retrieval. Experiments on seven benchmarks, covering both general and domain-specific datasets, demonstrate that our approach substantially improves the performance of open-source and closed-source large language models in Text-to-SQL tasks, especially in low-resource domain-specific settings, enhancing generalization, robustness, and adaptability.

Tianhao Qiu, Xiaojun Chen• 2026

Related benchmarks

Task	Dataset	Result
Text-to-SQL	BIRD (dev)	Execution Accuracy (EA)67.8	477
Text-to-SQL	Spider (test)	Execution Accuracy88.68	256
Text-to-SQL	Spider (dev)	EX82.34	196
Text-to-SQL	Spider-DK	Execution Accuracy (EX)80.97	136
Text-to-SQL	Spider-Syn	Execution Accuracy (EX)74.83	120
Text-to-SQL	EHRSQL	Execution Accuracy55.46	69
Text-to-SQL	Spider-Realistic	Execution Accuracy (EX)80.11	55
Text-to-SQL	Science Benchmark	Execution Accuracy59.53	48
Text-to-SQL Data Synthesis	BIRD Few Columns (train)	Token Cost (1k)1.28e+3	3
Text-to-SQL Data Synthesis	BIRD Medium Columns (train)	Token Cost (1k Tokens)2.17e+3	3

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord