Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning

About

Vision-as-inverse-graphics, the concept of reconstructing images into editable programs, remains challenging for Vision-Language Models (VLMs), which inherently lack fine-grained spatial grounding in one-shot settings. To address this, we introduce VIGA (Vision-as-Inverse-Graphics Agent), an interleaved multimodal reasoning framework where symbolic logic and visual perception actively cross-verify each other. VIGA operates through a tightly coupled code-render-inspect loop: synthesizing symbolic programs, projecting them into visual states, and inspecting discrepancies to guide iterative edits. Equipped with high-level semantic skills and an evolving multimodal memory, VIGA sustains evidence-based modifications over long horizons. This training-free, task-agnostic framework seamlessly supports 2D document generation, 3D reconstruction, multi-step 3D editing, and 4D physical interaction. Finally, we introduce BlenderBench, a challenging visual-to-code benchmark. Empirically, VIGA substantially improves accuracy compared with one-shot baselines in BlenderGym (35.32%), SlideBench (117.17%) and our proposed BlenderBench (124.70%).

Shaofeng Yin, Jiaxin Ge, Zora Zhiruo Wang, Chenyang Wang, Xiuyu Li, Michael J. Black, Trevor Darrell, Angjoo Kanazawa, Haiwen Feng• 2026

Related benchmarks

TaskDatasetResultRank
3D Graphic EditingBlenderGym
PL (Blend Shape)13.51
18
Camera AdjustmentBlenderBench
PL0.6082
10
Multi-step EditingBlenderBench
PL33.14
10
Compositional EditingBlenderBench
PL30.14
10
2D Slide GenerationSlideBench
Execution Score95
8
Overall EvaluationBlenderBench
Improvement159.2
8
Task 1BlenderBench
PL60.82
8
Task 2BlenderBench
PL33.14
8
Task 3BlenderBench
PL8.98
8
Showing 9 of 9 rows

Other info

Follow for update