TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

About

Recent advancements in Large Language Models (LLMs) have markedly enhanced the interpretation and processing of tabular data, introducing previously unimaginable capabilities. Despite these achievements, LLMs still encounter significant challenges when applied in industrial scenarios, particularly due to the increased complexity of reasoning required with real-world tabular data, underscoring a notable disparity between academic benchmarks and practical applications. To address this discrepancy, we conduct a detailed investigation into the application of tabular data in industrial scenarios and propose a comprehensive and complex benchmark TableBench, including 18 fields within four major categories of table question answering (TableQA) capabilities. Furthermore, we introduce TableLLM, trained on our meticulously constructed training set TableInstruct, achieving comparable performance with GPT-3.5. Massive experiments conducted on TableBench indicate that both open-source and proprietary LLMs still have significant room for improvement to meet real-world demands, where the most advanced model, GPT-4, achieves only a modest score compared to humans.

Xianjie Wu, Jian Yang, Linzheng Chai, Ge Zhang, Jiaheng Liu, Xinrun Du, Di Liang, Daixin Shu, Xianfu Cheng, Tianzhen Sun, Guanglin Niu, Tongliang Li, Zhoujun Li• 2024

Related benchmarks

Task	Dataset	Result
Instruction Following	IFEval	IFEval Accuracy28.53	836
Table Question Answering	WikiTQ	Accuracy73.02	149
Text-to-SQL	Spider	--	139
Table Question Answering	HiTab	Accuracy29.74	121
Table Question Answering	WTQ	Accuracy12.31	101
Table Question Answering	TabMWP	Accuracy18.5	97
Structure Comprehending	RealHitBench	Exact Match (EM)53.28	94
Fact Checking	RealHitBench	Exact Match33.53	94
Text-to-SQL	Bird	Total Execution Accuracy30.64	68
Numerical Reasoning	RealHitBench	Exact Match (EM)13.36	66

Showing 10 of 35 rows

Other info

Follow for update

@wizwand_team Discord