Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
About
Visual-semantic embedding aims to find a shared latent space where related visual and textual instances are close to each other. Most current methods learn injective embedding functions that map an instance to a single point in the shared space. Unfortunately, injective embedding cannot effectively handle polysemous instances with multiple possible meanings; at best, it would find an average representation of different meanings. This hinders its use in real-world scenarios where individual instances and their cross-modal associations are often ambiguous. In this work, we introduce Polysemous Instance Embedding Networks (PIE-Nets) that compute multiple and diverse representations of an instance by combining global context with locally-guided features via multi-head self-attention and residual learning. To learn visual-semantic embedding, we tie-up two PIE-Nets and optimize them jointly in the multiple instance learning framework. Most existing work on cross-modal retrieval focuses on image-text data. Here, we also tackle a more challenging case of video-text retrieval. To facilitate further research in video-text retrieval, we release a new dataset of 50K video-sentence pairs collected from social media, dubbed MRW (my reaction when). We demonstrate our approach on both image-text and video-text retrieval scenarios using MS-COCO, TGIF, and our new MRW dataset.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Image Retrieval | Flickr30k (test) | Recall@143.4 | 423 | |
| Image-to-Text Retrieval | Flickr30k (test) | R@159.1 | 370 | |
| Image-to-Text Retrieval | MS-COCO 5K (test) | R@145.2 | 299 | |
| Text-to-Image Retrieval | MSCOCO 5K (test) | R@132.4 | 286 | |
| Text-to-Image Retrieval | MS-COCO 5K (test) | R@132.4 | 223 | |
| Image-to-Text Retrieval | MS-COCO 1K (test) | R@169.2 | 121 | |
| Text-to-Image Retrieval | MSCOCO (1K test) | R@155.2 | 104 | |
| Image-to-Text Retrieval | MSCOCO (1K test) | R@169.2 | 82 | |
| Image Retrieval | Flickr30K 1K (test) | -- | 70 | |
| Text to Image | MS-COCO 1K (test) | R@155.2 | 53 |