Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

FashionViL: Fashion-Focused Vision-and-Language Representation Learning

About

Large-scale Vision-and-Language (V+L) pre-training for representation learning has proven to be effective in boosting various downstream V+L tasks. However, when it comes to the fashion domain, existing V+L methods are inadequate as they overlook the unique characteristics of both the fashion V+L data and downstream tasks. In this work, we propose a novel fashion-focused V+L representation learning framework, dubbed as FashionViL. It contains two novel fashion-specific pre-training tasks designed particularly to exploit two intrinsic attributes with fashion V+L data. First, in contrast to other domains where a V+L data point contains only a single image-text pair, there could be multiple images in the fashion domain. We thus propose a Multi-View Contrastive Learning task for pulling closer the visual representation of one image to the compositional multimodal representation of another image+text. Second, fashion text (e.g., product description) often contains rich fine-grained concepts (attributes/noun phrases). To exploit this, a Pseudo-Attributes Classification task is introduced to encourage the learned unimodal (visual/textual) representations of the same concept to be adjacent. Further, fashion V+L tasks uniquely include ones that do not conform to the common one-stream or two-stream architectures (e.g., text-guided image retrieval). We thus propose a flexible, versatile V+L model architecture consisting of a modality-agnostic Transformer so that it can be flexibly adapted to any downstream tasks. Extensive experiments show that our FashionViL achieves a new state of the art across five downstream tasks. Code is available at https://github.com/BrandonHanx/mmf.

Xiao Han, Licheng Yu, Xiatian Zhu, Li Zhang, Yi-Zhe Song, Tao Xiang• 2022

Related benchmarks

TaskDatasetResultRank
Composed Image RetrievalFashion-IQ (test)
Dress Recall@100.3347
145
Image-to-Text RetrievalFashionGen (test)
R@165.54
22
Text-to-Image RetrievalFashionGen (test)
R@161.88
22
Subcategory RecognitionFashionGen (test)
Accuracy94.01
8
Image CaptioningFashionGen (test)
BLEU16.71
7
Cross-modal retrievalFashionGen full 31 (test)
Recall@147.11
6
Image-to-Text RetrievalFashionGen 31 (test)
Recall@142.88
6
Text-to-Image RetrievalFashionGen 31 (test)
R@151.34
6
Category RecognitionFashionGen
Accuracy97.48
4
Subcategory RecognitionFashionGen
Accuracy92.23
4
Showing 10 of 10 rows

Other info

Follow for update