Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Tag2Text: Guiding Vision-Language Model via Image Tagging

About

This paper presents Tag2Text, a vision language pre-training (VLP) framework, which introduces image tagging into vision-language models to guide the learning of visual-linguistic features. In contrast to prior works which utilize object tags either manually labeled or automatically detected with an off-the-shelf detector with limited performance, our approach explicitly learns an image tagger using tags parsed from image-paired text and thus provides a strong semantic guidance to vision-language models. In this way, Tag2Text can utilize large-scale annotation-free image tags in accordance with image-text pairs, and provides more diverse tag categories beyond objects. As a result, Tag2Text demonstrates the ability of a foundational image tagging model, with superior zero-shot performance even comparable to fully supervised models. Moreover, by leveraging the tagging guidance, Tag2Text effectively enhances the performance of vision-language models on both generation-based and alignment-based tasks. Across a wide range of downstream benchmarks, Tag2Text achieves state-of-the-art results with similar model sizes and data scales, demonstrating the efficacy of the proposed tagging guidance. Code, demo and pre-trained models are available at https://github.com/xinyu1205/recognize-anything.

Xinyu Huang, Youcai Zhang, Jinyu Ma, Weiwei Tian, Rui Feng, Yuejie Zhang, Yaqian Li, Yandong Guo, Lei Zhang• 2023

Related benchmarks

TaskDatasetResultRank
Tag GenerationTOXICTAGS (test)
chrF21.02
12
Open vocabulary image classificationWiki-H
Used Objects397
10
Image ClassificationWorld-H (test)
Used Objects176
10
Medical image captioningRAD
Final Score0.3332
4
Medical image captioningSlake
Final Score0.2956
4
Human Alignment CorrelationVideo-Bench--
3
Showing 6 of 6 rows

Other info

Follow for update