LeftRefill: Filling Right Canvas based on Left Reference through Generalized Text-to-Image Diffusion Model
About
This paper introduces LeftRefill, an innovative approach to efficiently harness large Text-to-Image (T2I) diffusion models for reference-guided image synthesis. As the name implies, LeftRefill horizontally stitches reference and target views together as a whole input. The reference image occupies the left side, while the target canvas is positioned on the right. Then, LeftRefill paints the right-side target canvas based on the left-side reference and specific task instructions. Such a task formulation shares some similarities with contextual inpainting, akin to the actions of a human painter. This novel formulation efficiently learns both structural and textured correspondence between reference and target without other image encoders or adapters. We inject task and view information through cross-attention modules in T2I models, and further exhibit multi-view reference ability via the re-arranged self-attention modules. These enable LeftRefill to perform consistent generation as a generalized model without requiring test-time fine-tuning or model modifications. Thus, LeftRefill can be seen as a simple yet unified framework to address reference-guided synthesis. As an exemplar, we leverage LeftRefill to address two different challenges: reference-guided inpainting and novel view synthesis, based on the pre-trained StableDiffusion. Codes and models are released at https://github.com/ewrfcas/LeftRefill.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Ref-inpainting | MegaDepth (test) | PSNR21.779 | 12 | |
| Novel View Synthesis | Objaverse 1.0 (val) | PSNR24.685 | 7 | |
| Object Removal | SPInNeRF 51 (test) | PSNR30.29 | 6 | |
| Inpainting | Scannet++ + Real10K + DL3DV 89, 103, 41 (unseen) | PSNR15.14 | 6 | |
| Novel View Synthesis | Google Scanned Objects (GSO) (out-of-distribution) | PSNR23.169 | 5 | |
| 4-view Novel View Synthesis | Objaverse (test) | PSNR21.573 | 4 | |
| Object-centric New View Synthesis | Omni3D zero-shot (test) | PSNR17.09 | 4 | |
| Object-centric New View Synthesis | CO3D + MVImgNet (test) | PSNR17.74 | 4 | |
| Reference-guided Inpainting | real-world set (test) | PSNR25.733 | 3 |