Enhancing Visual In-Context Learning by Multi-Faceted Fusion

About

Visual In-Context Learning (VICL) has emerged as a powerful paradigm, enabling models to perform novel visual tasks by learning from in-context examples. The dominant "retrieve-then-prompt" approach typically relies on selecting the single best visual prompt, a practice that often discards valuable contextual information from other suitable candidates. While recent work has explored fusing the top-K prompts into a single, enhanced representation, this still simply collapses multiple rich signals into one, limiting the model's reasoning capability. We argue that a more multi-faceted, collaborative fusion is required to unlock the full potential of these diverse contexts. To address this limitation, we introduce a novel framework that moves beyond single-prompt fusion towards an multi-combination collaborative fusion. Instead of collapsing multiple prompts into one, our method generates three contextual representation branches, each formed by integrating information from different combinations of top-quality prompts. These complementary guidance signals are then fed into proposed MULTI-VQGAN architecture, which is designed to jointly interpret and utilize collaborative information from multiple sources. Extensive experiments on diverse tasks, including foreground segmentation, single-object detection, and image colorization, highlight its strong cross-task generalization, effective contextual fusion, and ability to produce more robust and accurate predictions than existing methods.

Wenwen Liao, Jianbo Yu, Yuansong Wang, Qingchao Jiang, Xiaofeng Yang• 2026

Related benchmarks

Task	Dataset	Result
Foreground segmentation	Pascal-5i Fold-0 (test)	mIoU47.64	25
Foreground segmentation	Pascal-5i Fold-1 (test)	mIoU54.96	25
Single Object Detection	PASCAL VOC 2012 (test)	mIoU45.19	24
Image Colorization	ImageNet 1k (test)	MSE0.53	17
Foreground segmentation	Pascal-5i Fold-2 (test)	mIoU46.07	13
Foreground segmentation	Pascal-5i Fold-3 (test)	mIoU0.4838	13
Foreground segmentation	Pascal-5i Mean of folds (test)	mIoU49.26	13
Semantic segmentation	PASCAL-5^2 5i (cross-dataset)	mIoU42.39	13

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord