Enhancing Visual In-Context Learning by Multi-Faceted Fusion
About
Visual In-Context Learning (VICL) has emerged as a powerful paradigm, enabling models to perform novel visual tasks by learning from in-context examples. The dominant "retrieve-then-prompt" approach typically relies on selecting the single best visual prompt, a practice that often discards valuable contextual information from other suitable candidates. While recent work has explored fusing the top-K prompts into a single, enhanced representation, this still simply collapses multiple rich signals into one, limiting the model's reasoning capability. We argue that a more multi-faceted, collaborative fusion is required to unlock the full potential of these diverse contexts. To address this limitation, we introduce a novel framework that moves beyond single-prompt fusion towards an multi-combination collaborative fusion. Instead of collapsing multiple prompts into one, our method generates three contextual representation branches, each formed by integrating information from different combinations of top-quality prompts. These complementary guidance signals are then fed into proposed MULTI-VQGAN architecture, which is designed to jointly interpret and utilize collaborative information from multiple sources. Extensive experiments on diverse tasks, including foreground segmentation, single-object detection, and image colorization, highlight its strong cross-task generalization, effective contextual fusion, and ability to produce more robust and accurate predictions than existing methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Foreground segmentation | Pascal-5i Fold-0 (test) | mIoU47.64 | 13 | |
| Foreground segmentation | Pascal-5i Fold-1 (test) | mIoU54.96 | 13 | |
| Foreground segmentation | Pascal-5i Fold-2 (test) | mIoU46.07 | 13 | |
| Foreground segmentation | Pascal-5i Fold-3 (test) | mIoU0.4838 | 13 | |
| Foreground segmentation | Pascal-5i Mean of folds (test) | mIoU49.26 | 13 | |
| Single Object Detection | PASCAL VOC 2012 (test) | mIoU45.19 | 13 | |
| Semantic segmentation | PASCAL-5^2 5i (cross-dataset) | mIoU42.39 | 13 | |
| Image Colorization | ImageNet 1k (test) | MSE0.53 | 10 |