Kaleido-BERT: Vision-Language Pre-training on Fashion Domain
About
We present a new vision-language (VL) pre-training model dubbed Kaleido-BERT, which introduces a novel kaleido strategy for fashion cross-modality representations from transformers. In contrast to random masking strategy of recent VL models, we design alignment guided masking to jointly focus more on image-text semantic relations. To this end, we carry out five novel tasks, i.e., rotation, jigsaw, camouflage, grey-to-color, and blank-to-color for self-supervised VL pre-training at patches of different scale. Kaleido-BERT is conceptually simple and easy to extend to the existing BERT framework, it attains new state-of-the-art results by large margins on four downstream tasks, including text retrieval (R@1: 4.03% absolute improvement), image retrieval (R@1: 7.13% abs imv.), category recognition (ACC: 3.28% abs imv.), and fashion captioning (Bleu4: 1.2 abs imv.). We validate the efficiency of Kaleido-BERT on a wide range of e-commerical websites, demonstrating its broader potential in real-world applications.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image-to-Text Retrieval | FashionGen (test) | R@127.99 | 22 | |
| Text-to-Image Retrieval | FashionGen (test) | R@133.88 | 22 | |
| Image-Text Retrieval | Fashion-Gen | Rank@127.99 | 10 | |
| Text-Image Retrieval | Fashion-Gen | Rank@133.88 | 10 | |
| Subcategory Recognition | FashionGen (test) | Accuracy88.07 | 8 | |
| Image Captioning | FashionGen (test) | BLEU5.7 | 7 | |
| Category and SubCategory Recognition | Fashion-Gen | Category Accuracy95.07 | 4 | |
| Category Recognition | FashionGen | Accuracy95.07 | 4 | |
| Subcategory Recognition | FashionGen | Accuracy88.07 | 4 | |
| Fashion Captioning | Fashion-Gen | BLEU-45.7 | 3 |