Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Classification Done Right for Vision-Language Pre-Training

About

We introduce SuperClass, a super simple classification method for vision-language pre-training on image-text data. Unlike its contrastive counterpart CLIP who contrast with a text encoder, SuperClass directly utilizes tokenized raw text as supervised classification labels, without the need for additional text filtering or selection. Due to the absence of the text encoding as contrastive target, SuperClass does not require a text encoder and does not need to maintain a large batch size as CLIP does. SuperClass demonstrated superior performance on various downstream tasks, including classic computer vision benchmarks and vision language downstream tasks. We further explored the scaling behavior of SuperClass on model size, training length, or data size, and reported encouraging results and comparisons to CLIP. https://github.com/x-cls/superclass

Zilong Huang, Qinghao Ye, Bingyi Kang, Jiashi Feng, Haoqi Fan• 2024

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVizWiz
Accuracy54.33
1525
Object Hallucination EvaluationPOPE
Accuracy85.69
1455
Visual Question AnsweringVQA v2
Accuracy75.24
1362
Visual Question AnsweringGQA
Accuracy60.96
1249
Multimodal EvaluationMME--
658
Image ClassificationImageNet 1k (test)
Top-1 Accuracy87.8
450
Multimodal UnderstandingMMMU
Accuracy36
437
Visual Question AnsweringScienceQA
Accuracy66.09
370
Image ClassificationImageNet-1k 1.0 (test)
Top-1 Accuracy85
229
Image CaptioningCOCO Captions (test)
CIDEr113
15
Showing 10 of 12 rows

Other info

Code

Follow for update