Object Recognition as Next Token Prediction
About
We present an approach to pose object recognition as next token prediction. The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels. To ground this prediction process in auto-regression, we customize a non-causal attention mask for the decoder, incorporating two key features: modeling tokens from different labels to be independent, and treating image tokens as a prefix. This masking mechanism inspires an efficient method - one-shot sampling - to simultaneously sample tokens of multiple labels in parallel and rank generated labels by their probabilities during inference. To further enhance the efficiency, we propose a simple strategy to construct a compact decoder by simply discarding the intermediate blocks of a pretrained language model. This approach yields a decoder that matches the full model's performance while being notably more efficient. The code is available at https://github.com/kaiyuyue/nxtp
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multi-label image recognition | MS-COCO 2014 (val) | mAP57.38 | 51 | |
| object recognition | COCO (val) | Recall76.5 | 31 | |
| object recognition | CC3M (test) | Recall0.738 | 21 | |
| object recognition | OpenImages v7 (val) | Recall66.3 | 21 | |
| Object Detection | Objects365 | AP23.81 | 15 | |
| Image Tagging | Objects365 | OP34.71 | 11 | |
| Object Recognition (Cross-Validation) | COCO (val) | Recall0.823 | 10 | |
| object recognition | CC3M | Recall86.8 | 3 | |
| object recognition | COCO | Recall93 | 3 | |
| object recognition | OpenImages | Recall0.874 | 3 |