Linking Image and Text with 2-Way Nets
About
Linking two data sources is a basic building block in numerous computer vision problems. Canonical Correlation Analysis (CCA) achieves this by utilizing a linear optimizer in order to maximize the correlation between the two views. Recent work makes use of non-linear models, including deep learning techniques, that optimize the CCA loss in some feature space. In this paper, we introduce a novel, bi-directional neural network architecture for the task of matching vectors from two data sources. Our approach employs two tied neural network channels that project the two views into a common, maximally correlated space using the Euclidean loss. We show a direct link between the correlation-based loss and Euclidean loss, enabling the use of Euclidean loss for correlation maximization. To overcome common Euclidean regression optimization problems, we modify well-known techniques to our problem, including batch normalization and dropout. We show state of the art results on a number of computer vision matching tasks including MNIST image matching and sentence-image matching on the Flickr8k, Flickr30k and COCO datasets.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Image Retrieval | Flickr30k (test) | Recall@136 | 423 | |
| Image-to-Text Retrieval | Flickr30k (test) | R@149.8 | 370 | |
| Image Retrieval | Flickr30k (test) | R@136 | 195 | |
| Image Retrieval | Flickr30K | R@136 | 144 | |
| Image Retrieval | MS-COCO 1K (test) | R@139.7 | 128 | |
| Text-to-Image Retrieval | MSCOCO (1K test) | R@139.7 | 104 | |
| Image-to-Text Retrieval | MSCOCO (1K test) | R@155.8 | 82 | |
| Image Search | Flickr8K | R@129.3 | 74 | |
| Caption Retrieval | MS COCO Karpathy 1k (test) | R@155.8 | 62 | |
| Image Annotation | Flickr30k (test) | R@149.8 | 39 |