Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

From Static to Interactive: Adapting Visual in-Context Learners for User-Driven Tasks

About

Visual in-context learning models are designed to adapt to new tasks by leveraging a set of example input-output pairs, enabling rapid generalization without task-specific fine-tuning. However, these models operate in a fundamentally static paradigm: while they can adapt to new tasks, they lack any mechanism to incorporate user-provided guidance signals such as scribbles, clicks, or bounding boxes to steer or refine the prediction process. This limitation is particularly restrictive in real-world applications, where users want to actively guide model predictions, e.g., by highlighting the target object for segmentation, indicating a region which should be visually altered, or isolating a specific person in a complex scene to run targeted pose estimation. In this work, we propose a simple method to transform static visual in-context learners, particularly the DeLVM approach, into highly controllable, user-driven systems, i.e., Interactive DeLVM, enabling seamless interaction through natural visual cues such as scribbles, clicks, or drawing boxes. Specifically, by encoding interactions directly into the example input-output pairs, we keep the philosophy of visual in-context learning intact: enabling users to prompt models with unseen interactions without fine-tuning and empowering them to dynamically steer model predictions with personalized interactions. Our experiments demonstrate that SOTA visual in-context learning models fail to effectively leverage interaction cues, often ignoring user guidance entirely. In contrast, our method excels in controllable, user-guided scenarios, achieving improvements of $+7.95%$ IoU for interactive segmentation, $+2.46$ PSNR for directed super-resolution, and $-3.14%$ LPIPS for interactive object removal. With this, our work bridges the gap between rigid static task adaptation and fluid interactivity for user-centric visual in-context learning.

Carlos Schmidt, Simon Rei{\ss}• 2026

Related benchmarks

TaskDatasetResultRank
Interactive Object RemovalRORD
LPIPS19
45
Interactive Semantic SegmentationPASCAL VOC 2012
Accuracy (Bbox)52.91
10
Interactive Super-resolutionADE20K Bounding Box Interaction
LPIPS41.74
9
Interactive Super-resolutionADE20K Ellipse Interaction
LPIPS39.81
9
Showing 4 of 4 rows

Other info

Follow for update