IDT: A Physically Grounded Transformer for Feed-Forward Multi-View Intrinsic Decomposition
About
Intrinsic image decomposition is fundamental for visual understanding, as RGB images entangle material properties, illumination, and view-dependent effects. Recent diffusion-based methods have achieved strong results for single-view intrinsic decomposition; however, extending these approaches to multi-view settings remains challenging, often leading to severe view inconsistency. We propose \textbf{Intrinsic Decomposition Transformer (IDT)}, a feed-forward framework for multi-view intrinsic image decomposition. By leveraging transformer-based attention to jointly reason over multiple input images, IDT produces view-consistent intrinsic factors in a single forward pass, without iterative generative sampling. IDT adopts a physically grounded image formation model that explicitly decomposes images into diffuse reflectance, diffuse shading, and specular shading. This structured factorization separates Lambertian and non-Lambertian light transport, enabling interpretable and controllable decomposition of material and illumination effects across views. Experiments on both synthetic and real-world datasets demonstrate that IDT achieves cleaner diffuse reflectance, more coherent diffuse shading, and better-isolated specular components, while substantially improving multi-view consistency compared to prior intrinsic decomposition methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Depth Estimation | Hypersim (test) | Delta143.3 | 17 | |
| Intrinsic Decomposition | Hypersim | Albedo PSNR22.85 | 3 | |
| Surface Normal Estimation | Hypersim (test) | Mean Angular Error14.1 | 2 |