Think2SQL: Reinforce LLM Reasoning Capabilities for Text2SQL

About

Large Language Models (LLMs) can translate natural language into SQL, but small models struggle with multi-table and complex queries in Zero-Shot Learning (ZSL) settings. While Supervised Fine-Tuning (SFT) helps, it falls short for harder cases. To address this, we study how different reasoning strategies (general-purpose reasoning in ZSL, reasoning traces in SFT, and Reinforcement Learning with Verifiable Reward (RLVR) with novel reward functions) affect Text2SQL performance across four benchmarks. We show that partial scoring rewards, computed via SQL execution, are crucial for guiding models even when outputs are not fully correct. These fine-grained signals lead to consistently better Text2SQL outcomes. Small LLMs benefit most from reasoning-aware SFT and RL, with the 14B Qwen-Coder-2.5 surpassing 400B+ models on challenging datasets like BIRD.

Simone Papicchio, Simone Rossi, Luca Cagliero, Paolo Papotti• 2025

Related benchmarks

Task	Dataset	Result
Text-to-SQL	BIRD (dev)	Execution Accuracy (EA)61.3	387
Text-to-SQL	Spider (test)	Execution Accuracy85.6	213
Text-to-SQL	Spider-DK	Execution Accuracy (EX)77.8	95
Text-to-SQL	Spider-Syn	Execution Accuracy (EX)78.6	79
Text-to-SQL	EHRSQL	Execution Accuracy38.9	61
Text-to-SQL	Spider-Realistic	Execution Accuracy (EX)81.8	47

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord