Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Linking Image and Text with 2-Way Nets

About

Linking two data sources is a basic building block in numerous computer vision problems. Canonical Correlation Analysis (CCA) achieves this by utilizing a linear optimizer in order to maximize the correlation between the two views. Recent work makes use of non-linear models, including deep learning techniques, that optimize the CCA loss in some feature space. In this paper, we introduce a novel, bi-directional neural network architecture for the task of matching vectors from two data sources. Our approach employs two tied neural network channels that project the two views into a common, maximally correlated space using the Euclidean loss. We show a direct link between the correlation-based loss and Euclidean loss, enabling the use of Euclidean loss for correlation maximization. To overcome common Euclidean regression optimization problems, we modify well-known techniques to our problem, including batch normalization and dropout. We show state of the art results on a number of computer vision matching tasks including MNIST image matching and sentence-image matching on the Flickr8k, Flickr30k and COCO datasets.

Aviv Eisenschtat, Lior Wolf• 2016

Related benchmarks

TaskDatasetResultRank
Text-to-Image RetrievalFlickr30k (test)
Recall@136
423
Image-to-Text RetrievalFlickr30k (test)
R@149.8
370
Image RetrievalFlickr30k (test)
R@136
195
Image RetrievalFlickr30K
R@136
144
Image RetrievalMS-COCO 1K (test)
R@139.7
128
Text-to-Image RetrievalMSCOCO (1K test)
R@139.7
104
Image-to-Text RetrievalMSCOCO (1K test)
R@155.8
82
Image SearchFlickr8K
R@129.3
74
Caption RetrievalMS COCO Karpathy 1k (test)
R@155.8
62
Image AnnotationFlickr30k (test)
R@149.8
39
Showing 10 of 19 rows

Other info

Code

Follow for update