Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks

About

In-context learning allows adapting a model to new tasks given a task description at test time. In this paper, we present IMProv - a generative model that is able to in-context learn visual tasks from multimodal prompts. Given a textual description of a visual task (e.g. "Left: input image, Right: foreground segmentation"), a few input-output visual examples, or both, the model in-context learns to solve it for a new test input. We train a masked generative transformer on a new dataset of figures from computer vision papers and their associated captions, together with a captioned large-scale image-text dataset. During inference time, we prompt the model with text and/or image task example(s) and have the model inpaint the corresponding output. We show that training our model with text conditioning and scaling the dataset size improves in-context learning for computer vision tasks by over +10\% AP for Foreground Segmentation, over +5\% gains in AP for Single Object Detection, and almost 20\% lower LPIPS in Colorization. Our empirical results suggest that vision and language prompts are complementary and it is advantageous to use both to achieve better in-context learning performance. Project page is available at https://jerryxu.net/IMProv .

Jiarui Xu, Yossi Gandelsman, Amir Bar, Jianwei Yang, Jianfeng Gao, Trevor Darrell, Xiaolong Wang• 2023

Related benchmarks

TaskDatasetResultRank
Image DerainingVisual In-Context Learning (V-ICL) Benchmark
PSNR15.29
5
Lineart estimationLineart
RMSE80.25
5
Surface Normal EstimationVisual In-Context Learning (V-ICL) Benchmark
Median Angular Error56.08
5
ColorizationVisual In-Context Learning (V-ICL) Benchmark
FID210.9
5
Interactive SegmentationVisual In-Context Learning (V-ICL) Benchmark
IoU17.8
5
Low-light enhancementVisual In-Context Learning (V-ICL) Benchmark
PSNR15.14
5
Object DetectionVisual In-Context Learning (V-ICL) Benchmark
IoU33.7
5
Depth EstimationVisual In-Context Learning (V-ICL) Benchmark
AbsRel0.175
5
Edge DetectionVisual In-Context Learning (V-ICL) Benchmark
RMSE99.36
5
Object DetectionPASCAL-5i
mIoU25.1
5
Showing 10 of 10 rows

Other info

Follow for update