Decoupling Skeleton and Flesh: Efficient Multimodal Table Reasoning with Disentangled Alignment and Structure-aware Guidance

About

Reasoning over table images remains challenging for Large Vision-Language Models (LVLMs) due to complex layouts and tightly coupled structure-content information. Existing solutions often depend on expensive supervised training, reinforcement learning, or external tools, limiting efficiency and scalability. This work addresses a key question: how to adapt LVLMs to table reasoning with minimal annotation and no external tools? Specifically, we first introduce DiSCo, a Disentangled Structure-Content alignment framework that explicitly separates structural abstraction from semantic grounding during multimodal alignment, efficiently adapting LVLMs to tables structures. Building on DiSCo, we further present Table-GLS, a Global-to-Local Structure-guided reasoning framework that performs table reasoning via structured exploration and evidence-grounded inference. Extensive experiments across diverse benchmarks demonstrate that our framework efficiently enhances LVLM's table understanding and reasoning capabilities, particularly generalizing to unseen table structures. Our data and code are available at https://github.com/AAAndy-Zhu/TableVLM.

Yingjie Zhu, Xuefeng Bai, Kehai Chen, Yang Xiang, Youcheng Pan, Xiaoqiang Zhou, Min Zhang• 2026

Related benchmarks

Task	Dataset	Result
Text-based Visual Question Answering	TextVQA	Accuracy80.83	962
Science Question Answering	ScienceQA	Accuracy95.09	791
Table Fact Verification	TabFact (test)	Accuracy75.41	146
Visual Hallucination Evaluation	HallusionBench	Accuracy74.97	120
Hallucination Evaluation	CRPE relation	Accuracy77.92	23
Table Structure Detection	MMTab In-domain	Row Score64.2	19
Table Question Answering	TAT-QA (test)	Accuracy40.54	15
Question Answering	WTQ (test)	Accuracy57.11	11
Fact Verification	InfoTabs (test)	Accuracy72.67	11
Question Answering	HiTab (test)	Accuracy35.47	11

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord