FashionViL: Fashion-Focused Vision-and-Language Representation Learning

About

Large-scale Vision-and-Language (V+L) pre-training for representation learning has proven to be effective in boosting various downstream V+L tasks. However, when it comes to the fashion domain, existing V+L methods are inadequate as they overlook the unique characteristics of both the fashion V+L data and downstream tasks. In this work, we propose a novel fashion-focused V+L representation learning framework, dubbed as FashionViL. It contains two novel fashion-specific pre-training tasks designed particularly to exploit two intrinsic attributes with fashion V+L data. First, in contrast to other domains where a V+L data point contains only a single image-text pair, there could be multiple images in the fashion domain. We thus propose a Multi-View Contrastive Learning task for pulling closer the visual representation of one image to the compositional multimodal representation of another image+text. Second, fashion text (e.g., product description) often contains rich fine-grained concepts (attributes/noun phrases). To exploit this, a Pseudo-Attributes Classification task is introduced to encourage the learned unimodal (visual/textual) representations of the same concept to be adjacent. Further, fashion V+L tasks uniquely include ones that do not conform to the common one-stream or two-stream architectures (e.g., text-guided image retrieval). We thus propose a flexible, versatile V+L model architecture consisting of a modality-agnostic Transformer so that it can be flexibly adapted to any downstream tasks. Extensive experiments show that our FashionViL achieves a new state of the art across five downstream tasks. Code is available at https://github.com/BrandonHanx/mmf.

Xiao Han, Licheng Yu, Xiatian Zhu, Li Zhang, Yi-Zhe Song, Tao Xiang• 2022

Related benchmarks

Task	Dataset	Result
Composed Image Retrieval	Fashion-IQ (test)	Average Recall@100.3121	200
Image-to-Text Retrieval	FashionGen (test)	R@165.54	22
Text-to-Image Retrieval	FashionGen (test)	R@161.88	22
Image Retrieval	Marqo Fashion200K	MRR61.7	11
Image Retrieval	NAIL-STAR	MRR20.8	11
Image-Text Retrieval	NAIL-STAR (test)	MRR2.3	10
Subcategory Recognition	FashionGen (test)	Accuracy94.01	8
Image Captioning	FashionGen (test)	BLEU16.71	7
Cross-modal retrieval	FashionGen full 31 (test)	Recall@147.11	6
Image-to-Text Retrieval	FashionGen 31 (test)	Recall@142.88	6

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord