Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

About

We present a new vision-language (VL) pre-training model dubbed Kaleido-BERT, which introduces a novel kaleido strategy for fashion cross-modality representations from transformers. In contrast to random masking strategy of recent VL models, we design alignment guided masking to jointly focus more on image-text semantic relations. To this end, we carry out five novel tasks, i.e., rotation, jigsaw, camouflage, grey-to-color, and blank-to-color for self-supervised VL pre-training at patches of different scale. Kaleido-BERT is conceptually simple and easy to extend to the existing BERT framework, it attains new state-of-the-art results by large margins on four downstream tasks, including text retrieval (R@1: 4.03% absolute improvement), image retrieval (R@1: 7.13% abs imv.), category recognition (ACC: 3.28% abs imv.), and fashion captioning (Bleu4: 1.2 abs imv.). We validate the efficiency of Kaleido-BERT on a wide range of e-commerical websites, demonstrating its broader potential in real-world applications.

Mingchen Zhuge, Dehong Gao, Deng-Ping Fan, Linbo Jin, Ben Chen, Haoming Zhou, Minghui Qiu, Ling Shao• 2021

Related benchmarks

TaskDatasetResultRank
Image-to-Text RetrievalFashionGen (test)
R@127.99
22
Text-to-Image RetrievalFashionGen (test)
R@133.88
22
Image-Text RetrievalFashion-Gen
Rank@127.99
10
Text-Image RetrievalFashion-Gen
Rank@133.88
10
Subcategory RecognitionFashionGen (test)
Accuracy88.07
8
Image CaptioningFashionGen (test)
BLEU5.7
7
Category and SubCategory RecognitionFashion-Gen
Category Accuracy95.07
4
Category RecognitionFashionGen
Accuracy95.07
4
Subcategory RecognitionFashionGen
Accuracy88.07
4
Fashion CaptioningFashion-Gen
BLEU-45.7
3
Showing 10 of 10 rows

Other info

Code

Follow for update