Probing How Scalable Table Data Enhances General Long-Context Reasoning
About
As real-world tasks grow increasingly complex, long-context reasoning has become a core capability for Large Language Models (LLMs). However, few studies explore which data types are effective for long-context reasoning and why. We find that structured table data with periodic structures shows strong potential for long-context reasoning. Motivated by this observation, we mathematically analyze tabular dependency structures using mutual information, revealing periodic non-vanishing dependencies in table data. Furthermore, we systematically analyze the capabilities of structured table data, conduct relevant scaling experiments, and validate its underlying mechanisms for enhancing long-context reasoning, yielding several meaningful insights. Leveraging these insights, we propose a simple yet scalable pipeline(TableLong) for synthesizing high-quality, diverse, and verifiable structured table data to boost long-context reasoning via RL. Extensive experimental results demonstrate that table data significantly enhances the long-context reasoning capability of LLMs across multiple long-context benchmarks (+8.24\% on average), and even improves performance on out-of-domain benchmarks (+8.06\% on average). We hope that our insights provide practical guidance for effective post-training data to enhance long-context reasoning in LLMs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Code Reasoning | LiveCodeBench | Accuracy58.68 | 62 | |
| Long-context Reasoning | LongBench v2 | -- | 48 | |
| Long-context retrieval and synthetic reasoning | RULER | Accuracy80.72 | 47 | |
| Science Reasoning | GPQA Diamond | Accuracy63.64 | 34 | |
| Long-context Understanding | MRCR | Accuracy42.66 | 15 | |
| Long-context Reasoning | BrowsCompLong | Accuracy74.31 | 11 | |
| Long-Context Mathematical Reasoning | GSM-Infinite | Accuracy23.4 | 11 | |
| Long-context Reasoning | Loong | Accuracy45.3 | 11 | |
| Long-context Reasoning | Oolong-Synth | Accuracy51.41 | 11 | |
| Multi-turn Dialogue Reasoning | MultiChallenge | Accuracy32.97 | 4 |