Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

A Unified Sequence Interface for Vision Tasks

About

While language tasks are naturally expressed in a single, unified, modeling framework, i.e., generating sequences of tokens, this has not been the case in computer vision. As a result, there is a proliferation of distinct architectures and loss functions for different vision tasks. In this work we show that a diverse set of "core" computer vision tasks can also be unified if formulated in terms of a shared pixel-to-sequence interface. We focus on four tasks, namely, object detection, instance segmentation, keypoint detection, and image captioning, all with diverse types of outputs, e.g., bounding boxes or dense masks. Despite that, by formulating the output of each task as a sequence of discrete tokens with a unified interface, we show that one can train a neural network with a single model architecture and loss function on all these tasks, with no task-specific customization. To solve a specific task, we use a short prompt as task description, and the sequence output adapts to the prompt so it can produce task-specific output. We show that such a model can achieve competitive performance compared to well-established task-specific models.

Ting Chen, Saurabh Saxena, Lala Li, Tsung-Yi Lin, David J. Fleet, Geoffrey Hinton• 2022

Related benchmarks

TaskDatasetResultRank
Object DetectionCOCO 2017 (val)
AP46.5
2454
Instance SegmentationCOCO 2017 (val)--
1144
Object DetectionCOCO (val)
mAP46.5
613
Instance SegmentationCOCO (val)
APmk38.2
472
Instance SegmentationCOCO
APmask38.7
279
Unconditional Image GenerationCIFAR-10 unconditional
FID12.75
159
Object DetectionCOCO--
144
Object DetectionCOCO
mAP46.5
107
Image CaptioningMS-COCO
CIDEr1.18
61
Keypoint DetectionCOCO (val)
AP64.8
60
Showing 10 of 15 rows

Other info

Code

Follow for update