Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

VIRAL: Visual In-Context Reasoning via Analogy in Diffusion Transformers

About

Replicating In-Context Learning (ICL) in computer vision remains challenging due to task heterogeneity. We propose \textbf{VIRAL}, a framework that elicits visual reasoning from a pre-trained image editing model by formulating ICL as conditional generation via visual analogy ($x_s : x_t :: x_q : y_q$). We adapt a frozen Diffusion Transformer (DiT) using role-aware multi-image conditioning and introduce a Mixture-of-Experts LoRA to mitigate gradient interference across diverse tasks. Additionally, to bridge the gaps in current visual context datasets, we curate a large-scale dataset spanning perception, restoration, and editing. Experiments demonstrate that VIRAL outperforms existing methods, validating that a unified V-ICL paradigm can handle the majority of visual tasks, including open-domain editing. Our code is available at https://anonymous.4open.science/r/VIRAL-744A

Zhiwen Li, Zhongjie Duan, Jinyan Ye, Cen Chen, Daoyuan Chen, Yaliang Li, Yingda Chen• 2026

Related benchmarks

TaskDatasetResultRank
Low-light Image EnhancementLOL v2
PSNR25.248
32
ColorizationVisual In-Context Learning (V-ICL) Benchmark
FID43.75
5
Depth EstimationVisual In-Context Learning (V-ICL) Benchmark
AbsRel0.138
5
Edge DetectionVisual In-Context Learning (V-ICL) Benchmark
RMSE28.59
5
Image DerainingVisual In-Context Learning (V-ICL) Benchmark
PSNR29.67
5
Interactive SegmentationVisual In-Context Learning (V-ICL) Benchmark
IoU79.5
5
Lineart estimationLineart
RMSE34.82
5
Low-light enhancementVisual In-Context Learning (V-ICL) Benchmark
PSNR25.24
5
Object DetectionVisual In-Context Learning (V-ICL) Benchmark
IoU56.2
5
Object DetectionPASCAL-5i
mIoU72.1
5
Showing 10 of 17 rows

Other info

Follow for update