Class-agnostic Object Detection with Multi-modal Transformer
About
What constitutes an object? This has been a long-standing question in computer vision. Towards this goal, numerous learning-free and learning-based approaches have been developed to score objectness. However, they generally do not scale well across new domains and novel objects. In this paper, we advocate that existing methods lack a top-down supervision signal governed by human-understandable semantics. For the first time in literature, we demonstrate that Multi-modal Vision Transformers (MViT) trained with aligned image-text pairs can effectively bridge this gap. Our extensive experiments across various domains and novel objects show the state-of-the-art performance of MViTs to localize generic objects in images. Based on the observation that existing MViTs do not include multi-scale feature processing and usually require longer training schedules, we develop an efficient MViT architecture using multi-scale deformable attention and late vision-language fusion. We show the significance of MViT proposals in a diverse range of applications including open-world object detection, salient and camouflage object detection, supervised and self-supervised detection tasks. Further, MViTs can adaptively generate proposals given a specific language query and thus offer enhanced interactability. Code: \url{https://git.io/J1HPY}.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Salient Object Detection | ECSSD | MAE0.24 | 202 | |
| Object Detection | LVIS (val) | -- | 141 | |
| Salient Object Detection | DUT-OMRON | MAE0.21 | 120 | |
| Salient Object Detection | ECSSD (test) | -- | 104 | |
| Salient Object Detection | DUT-OMRON (test) | -- | 92 | |
| Camouflaged Object Detection | COD10K | -- | 83 | |
| Object Detection | DOTA | -- | 28 | |
| Object Detection | PACO LVIS | AR@5027.9 | 14 | |
| Object Detection | COCO (val) | AR5069.7 | 14 | |
| Class-agnostic Object Detection | Pascal VOC | AP506.86e+3 | 9 |