GeoTikzBridge: Advancing Multimodal Code Generation for Geometric Perception and Reasoning
About
Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable perceptual and reasoning abilities. However, they struggle to perceive fine-grained geometric structures, constraining their ability of geometric understanding and visual reasoning. To address this, we propose GeoTikzBridge, a framework that enhances local geometric perception and visual reasoning through tikz-based code generation. Within this framework, we build two models supported by two complementary datasets. The GeoTikzBridge-Base model is trained on GeoTikz-Base dataset, the largest image-to-tikz dataset to date with 2.5M pairs (16 $\times$ larger than existing open-sourced datasets). This process is achieved via iterative data expansion and a localized geometric transformation strategy. Subsequently, GeoTikzBridge-Instruct is fine-tuned on GeoTikz-Instruct dataset which is the first instruction-augmented tikz dataset supporting visual reasoning. Extensive experimental results demonstrate that our models achieve state-of-the-art performance among open-sourced MLLMs. Furthermore, GeoTikzBridge models can serve as plug-and-play reasoning modules for any MLLM(LLM), enhancing reasoning performance in geometric problem-solving. Datasets and codes are publicly available at: https://github.com/sjy-1995/GeoTikzBridge.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MathVista | Accuracy (All)88.9 | 43 | |
| Math Reasoning | MathBench EN | Score49.8 | 32 | |
| Mathematical Reasoning | GAOKAO-MM Math | Accuracy68.8 | 17 | |
| Image-to-TikZ | DaTikZ | CLIP Score81.3 | 16 | |
| Image-to-TikZ | MathVista GPS | CLIP Score91.5 | 16 | |
| Mathematical Reasoning | RBench-V Math | Accuracy30.1 | 16 | |
| Mathematical Reasoning | RBench-V (Overall) | Accuracy16.4 | 16 | |
| Image-to-TikZ | GAOKAO-MM Math | CLIP Score90 | 9 | |
| Image-to-TikZ | EDUBenchmark | CLIP Score82.1 | 9 | |
| Instructed Code Generation | Instructed Code Generation | CLIP Score99.2 | 3 |