Visual Prompting via Image Inpainting
About
How does one adapt a pre-trained visual model to novel downstream tasks without task-specific finetuning or any model modification? Inspired by prompting in NLP, this paper investigates visual prompting: given input-output image example(s) of a new task at test time and a new input image, the goal is to automatically produce the output image, consistent with the given examples. We show that posing this problem as simple image inpainting - literally just filling in a hole in a concatenated visual prompt image - turns out to be surprisingly effective, provided that the inpainting algorithm has been trained on the right data. We train masked auto-encoders on a new dataset that we curated - 88k unlabeled figures from academic papers sources on Arxiv. We apply visual prompting to these pretrained models and demonstrate results on various downstream image-to-image tasks, including foreground segmentation, single object detection, colorization, edge detection, etc.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | PASCAL-5^i Fold-0 | mIoU28.66 | 75 | |
| Semantic segmentation | PASCAL-5^i Fold-1 | mIoU30.21 | 75 | |
| Semantic segmentation | PASCAL-5^i Fold-2 | mIoU27.81 | 75 | |
| Semantic segmentation | PASCAL-5^i Fold-3 | mIoU23.55 | 75 | |
| 3D Pose Estimation | Human3.6M | MPJPE (mm)351 | 66 | |
| Few-shot Segmentation | FSS-1000 (test) | mIoU58.3 | 50 | |
| Few-shot Segmentation | PASCAL-5i | -- | 46 | |
| Single Object Detection | PASCAL VOC 2012 | mIoU25.36 | 27 | |
| Foreground segmentation | Pascal-5i (1) | mIoU30.44 | 16 | |
| Future Pose Estimation | H3.6M | MPJPE (200ms)316.8 | 15 |