VIRAL: Visual In-Context Reasoning via Analogy in Diffusion Transformers
About
Replicating In-Context Learning (ICL) in computer vision remains challenging due to task heterogeneity. We propose \textbf{VIRAL}, a framework that elicits visual reasoning from a pre-trained image editing model by formulating ICL as conditional generation via visual analogy ($x_s : x_t :: x_q : y_q$). We adapt a frozen Diffusion Transformer (DiT) using role-aware multi-image conditioning and introduce a Mixture-of-Experts LoRA to mitigate gradient interference across diverse tasks. Additionally, to bridge the gaps in current visual context datasets, we curate a large-scale dataset spanning perception, restoration, and editing. Experiments demonstrate that VIRAL outperforms existing methods, validating that a unified V-ICL paradigm can handle the majority of visual tasks, including open-domain editing. Our code is available at https://anonymous.4open.science/r/VIRAL-744A
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Low-light Image Enhancement | LOL v2 | PSNR25.248 | 32 | |
| Colorization | Visual In-Context Learning (V-ICL) Benchmark | FID43.75 | 5 | |
| Depth Estimation | Visual In-Context Learning (V-ICL) Benchmark | AbsRel0.138 | 5 | |
| Edge Detection | Visual In-Context Learning (V-ICL) Benchmark | RMSE28.59 | 5 | |
| Image Deraining | Visual In-Context Learning (V-ICL) Benchmark | PSNR29.67 | 5 | |
| Interactive Segmentation | Visual In-Context Learning (V-ICL) Benchmark | IoU79.5 | 5 | |
| Lineart estimation | Lineart | RMSE34.82 | 5 | |
| Low-light enhancement | Visual In-Context Learning (V-ICL) Benchmark | PSNR25.24 | 5 | |
| Object Detection | Visual In-Context Learning (V-ICL) Benchmark | IoU56.2 | 5 | |
| Object Detection | PASCAL-5i | mIoU72.1 | 5 |