CodeDance: A Dynamic Tool-integrated MLLM for Executable Visual Reasoning
About
Recent releases such as o3 highlight human-like "thinking with images" reasoning that combines structured tool use with stepwise verification, yet most open-source approaches still rely on text-only chains, rigid visual schemas, or single-step pipelines, limiting flexibility, interpretability, and transferability on complex tasks. We introduce CodeDance, which explores executable code as a general solver for visual reasoning. Unlike fixed-schema calls (e.g., only predicting bounding-box coordinates), CodeDance defines, composes, and executes code to orchestrate multiple tools, compute intermediate results, and render visual artifacts (e.g., boxes, lines, plots) that support transparent, self-checkable reasoning. To guide this process, we introduce a reward for balanced and adaptive tool-call, which balances exploration with efficiency and mitigates tool overuse. Interestingly, beyond the expected capabilities taught by atomic supervision, we empirically observe novel emergent behaviors during RL training: CodeDance demonstrates novel tool invocations, unseen compositions, and cross-task transfer. These behaviors arise without task-specific fine-tuning, suggesting a general and scalable mechanism of executable visual reasoning. Extensive experiments across reasoning benchmarks (e.g., visual search, math, chart QA) show that CodeDance not only consistently outperforms schema-driven and text-only baselines, but also surpasses advanced closed models such as GPT-4o and larger open-source models.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Chart Understanding | ChartQA | Accuracy87.5 | 83 | |
| Counting | CountBench | Accuracy91.2 | 52 | |
| Mathematical Multimodal Reasoning | MathVista | Accuracy70.3 | 46 | |
| Multimodal Reasoning | WeMath | Accuracy39.6 | 43 | |
| Multimodal Math Reasoning | MathVision | Accuracy29.6 | 31 | |
| Mathematical Multimodal Reasoning | MathVerse | Accuracy46.8 | 29 | |
| Multimodal Math Reasoning | WeMath | Accuracy39.6 | 26 | |
| Visual Search | HR-Bench-8K | Accuracy72.3 | 23 | |
| Visual Search | HR-Bench-4K | Accuracy75.2 | 23 | |
| Multimodal Reasoning | MathVision | -- | 23 |