Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Tag2Text: Guiding Vision-Language Model via Image Tagging

About

This paper presents Tag2Text, a vision language pre-training (VLP) framework, which introduces image tagging into vision-language models to guide the learning of visual-linguistic features. In contrast to prior works which utilize object tags either manually labeled or automatically detected with an off-the-shelf detector with limited performance, our approach explicitly learns an image tagger using tags parsed from image-paired text and thus provides a strong semantic guidance to vision-language models. In this way, Tag2Text can utilize large-scale annotation-free image tags in accordance with image-text pairs, and provides more diverse tag categories beyond objects. As a result, Tag2Text demonstrates the ability of a foundational image tagging model, with superior zero-shot performance even comparable to fully supervised models. Moreover, by leveraging the tagging guidance, Tag2Text effectively enhances the performance of vision-language models on both generation-based and alignment-based tasks. Across a wide range of downstream benchmarks, Tag2Text achieves state-of-the-art results with similar model sizes and data scales, demonstrating the efficacy of the proposed tagging guidance. Code, demo and pre-trained models are available at https://github.com/xinyu1205/recognize-anything.

Xinyu Huang, Youcai Zhang, Jinyu Ma, Weiwei Tian, Rui Feng, Yuejie Zhang, Yaqian Li, Yandong Guo, Lei Zhang• 2023

Related benchmarks

TaskDatasetResultRank
Open vocabulary image classificationWiki-H
Used Objects397
10
Image ClassificationWorld-H (test)
Used Objects176
10
Medical image captioningRAD
Final Score0.3332
4
Medical image captioningSlake
Final Score0.2956
4
Human Alignment CorrelationVideo-Bench--
3
Showing 5 of 5 rows

Other info

Follow for update