Table-R1: Inference-Time Scaling for Table Reasoning

About

In this work, we present the first study to explore inference-time scaling on table reasoning tasks. We develop and evaluate two post-training strategies to enable inference-time scaling: distillation from frontier model reasoning traces and reinforcement learning with verifiable rewards (RLVR). For distillation, we introduce a large-scale dataset of reasoning traces generated by DeepSeek-R1, which we use to fine-tune LLMs into the Table-R1-SFT model. For RLVR, we propose task-specific verifiable reward functions and apply the GRPO algorithm to obtain the Table-R1-Zero model. We evaluate our Table-R1-series models across diverse table reasoning tasks, including short-form QA, fact verification, and free-form QA. Notably, the Table-R1-Zero model matches or exceeds the performance of GPT-4.1 and DeepSeek-R1, while using only a 7B-parameter LLM. It also demonstrates strong generalization to out-of-domain datasets. Extensive ablation and qualitative analyses reveal the benefits of instruction tuning, model architecture choices, and cross-task generalization, as well as emergence of essential table reasoning skills during RL training.

Zheyuan Yang, Lyuhao Chen, Arman Cohan, Yilun Zhao• 2025

Related benchmarks

Task	Dataset	Result
Table Fact Verification	TabFact (test)	Accuracy87.17	146
Table Question Answering	WikiTQ (test)	Accuracy81.7	140
Text-to-SQL	Spider	--	139
Structure Comprehending	RealHitBench	Exact Match (EM)28.5	94
Fact Checking	RealHitBench	Exact Match0.00e+0	94
Text-to-SQL	Bird	Total Execution Accuracy50.98	68
Numerical Reasoning	RealHitBench	Exact Match (EM)0.00e+0	66
Chart Generation	RealHitBench	ECR16	60
Data Analysis	RealHitBench	GPT Score36.24	60
Financial Question Answering	FinQA (test)	Accuracy41.27	57

Showing 10 of 20 rows

Other info

Follow for update

@wizwand_team Discord