GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning
About
Large Multimodal Models (LMMs) often struggle with geometric reasoning due to visual hallucinations and a lack of mathematically precise Chain-of-Thought (CoT) data. To address this, we propose the GeoSym Engine, an automated and scalable neuro-symbolic framework. By leveraging a type-conditional grammar and an analytic SymGT Solver, it derives exact symbolic ground truths and seamlessly integrates with a robust rendering pipeline to produce high-precision geometric diagrams. Using this engine, we construct GeoSym127K, a difficulty-stratified dataset featuring 51K high-resolution images, 127K questions with symbolic ground truths, and 55K answer-verified CoT QA pairs. We also introduce GeoSym-Bench, an expert-curated suite of 511 complex samples for rigorous evaluation. Through extensive supervised fine-tuning (SFT), we demonstrate that GeoSym drives concentrated improvements specifically on diagram-dependent and multi-step geometry tasks. Our Qwen3-VL-8B model gains an absolute +22.21% on the MathVerse Vision-Only subset and reaches 61.52% (+6.19% improvement) on WeMath, mitigating long-horizon logic fragmentation and outperforming advanced closed-source models like Doubao-1.8. Furthermore, applying Reinforcement Learning with Verifiable Rewards (RLVR) via GRPO reveals that initializing from structural SFT checkpoints substantially elevates the performance ceiling over zero-shot RL. Driven by deterministic exact-match signals, this showcases the robust scaling potential of our verifiable reasoning synthesis. Datasets and code are available at https://huggingface.co/datasets/Tomie0506/GeoSym127K and https://github.com/Tomie56/GeoSym127K.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Mathematical Reasoning | MathVista 14 (1000) | Macro Score76.6 | 22 | |
| Multimodal Mathematical Reasoning | Aggregate Math Benchmarks | Overall Macro Score63.18 | 6 | |
| Geometric Reasoning | GeoSym-Bench (test) | Accuracy (%)18.79 | 4 | |
| Multimodal Mathematical Reasoning | MathVerse Vision-only 35 | Macro Avg Score60.53 | 4 | |
| Multimodal Mathematical Reasoning | MathVision 28 (3040) | Macro-average Score54.21 | 2 | |
| Multimodal Mathematical Reasoning | WeMath 19 | Macro Average Score61.52 | 2 |