Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning
About
Vision-as-inverse-graphics, the concept of reconstructing images into editable programs, remains challenging for Vision-Language Models (VLMs), which inherently lack fine-grained spatial grounding in one-shot settings. To address this, we introduce VIGA (Vision-as-Inverse-Graphics Agent), an interleaved multimodal reasoning framework where symbolic logic and visual perception actively cross-verify each other. VIGA operates through a tightly coupled code-render-inspect loop: synthesizing symbolic programs, projecting them into visual states, and inspecting discrepancies to guide iterative edits. Equipped with high-level semantic skills and an evolving multimodal memory, VIGA sustains evidence-based modifications over long horizons. This training-free, task-agnostic framework seamlessly supports 2D document generation, 3D reconstruction, multi-step 3D editing, and 4D physical interaction. Finally, we introduce BlenderBench, a challenging visual-to-code benchmark. Empirically, VIGA substantially improves accuracy compared with one-shot baselines in BlenderGym (35.32%), SlideBench (117.17%) and our proposed BlenderBench (124.70%).
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Graphic Editing | BlenderGym | PL (Blend Shape)13.51 | 18 | |
| Camera Adjustment | BlenderBench | PL0.6082 | 10 | |
| Multi-step Editing | BlenderBench | PL33.14 | 10 | |
| Compositional Editing | BlenderBench | PL30.14 | 10 | |
| 2D Slide Generation | SlideBench | Execution Score95 | 8 | |
| Overall Evaluation | BlenderBench | Improvement159.2 | 8 | |
| Task 1 | BlenderBench | PL60.82 | 8 | |
| Task 2 | BlenderBench | PL33.14 | 8 | |
| Task 3 | BlenderBench | PL8.98 | 8 |