Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning

About

Vision-as-inverse-graphics, the concept of reconstructing images into editable programs, remains challenging for Vision-Language Models (VLMs), which inherently lack fine-grained spatial grounding in one-shot settings. To address this, we introduce VIGA (Vision-as-Inverse-Graphics Agent), an interleaved multimodal reasoning framework where symbolic logic and visual perception actively cross-verify each other. VIGA operates through a tightly coupled code-render-inspect loop: synthesizing symbolic programs, projecting them into visual states, and inspecting discrepancies to guide iterative edits. Equipped with high-level semantic skills and an evolving multimodal memory, VIGA sustains evidence-based modifications over long horizons. This training-free, task-agnostic framework seamlessly supports 2D document generation, 3D reconstruction, multi-step 3D editing, and 4D physical interaction. Finally, we introduce BlenderBench, a challenging visual-to-code benchmark. Empirically, VIGA substantially improves accuracy compared with one-shot baselines in BlenderGym (35.32%), SlideBench (117.17%) and our proposed BlenderBench (124.70%).

Shaofeng Yin, Jiaxin Ge, Zora Zhiruo Wang, Chenyang Wang, Xiuyu Li, Michael J. Black, Trevor Darrell, Angjoo Kanazawa, Haiwen Feng• 2026

Related benchmarks

Task	Dataset	Result
3D Graphic Editing	BlenderGym	PL (Blend Shape)13.51	18
Camera Adjustment	BlenderBench	PL0.6082	10
Multi-step Editing	BlenderBench	PL33.14	10
Compositional Editing	BlenderBench	PL30.14	10
2D Slide Generation	SlideBench	Execution Score95	8
Overall Evaluation	BlenderBench	Improvement159.2	8
Task 1	BlenderBench	PL60.82	8
Task 2	BlenderBench	PL33.14	8
Task 3	BlenderBench	PL8.98	8
Image-to-3D Room Generation	Code-as-Room Human Evaluation 1.0 (test)	Similarity5.5	6

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord