IDT: A Physically Grounded Transformer for Feed-Forward Multi-View Intrinsic Decomposition

About

Intrinsic image decomposition is fundamental for visual understanding, as RGB images entangle material properties, illumination, and view-dependent effects. Recent diffusion-based methods have achieved strong results for single-view intrinsic decomposition; however, extending these approaches to multi-view settings remains challenging, often leading to severe view inconsistency. We propose \textbf{Intrinsic Decomposition Transformer (IDT)}, a feed-forward framework for multi-view intrinsic image decomposition. By leveraging transformer-based attention to jointly reason over multiple input images, IDT produces view-consistent intrinsic factors in a single forward pass, without iterative generative sampling. IDT adopts a physically grounded image formation model that explicitly decomposes images into diffuse reflectance, diffuse shading, and specular shading. This structured factorization separates Lambertian and non-Lambertian light transport, enabling interpretable and controllable decomposition of material and illumination effects across views. Experiments on both synthetic and real-world datasets demonstrate that IDT achieves cleaner diffuse reflectance, more coherent diffuse shading, and better-isolated specular components, while substantially improving multi-view consistency compared to prior intrinsic decomposition methods.

Kang Du, Yirui Guan, Zeyu Wang• 2025

Related benchmarks

Task	Dataset	Result
Intrinsic Decomposition	Hypersim	Albedo PSNR22.85	17
Depth Estimation	Hypersim (test)	Delta143.3	17
Surface Normal Estimation	Hypersim (test)	Mean Angular Error14.1	9

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord