Learning Visual N-Grams from Web Data
About
Real-world image recognition systems need to recognize tens of thousands of classes that constitute a plethora of visual concepts. The traditional approach of annotating thousands of images per class for training is infeasible in such a scenario, prompting the use of webly supervised data. This paper explores the training of image-recognition systems on large numbers of images and associated user comments. In particular, we develop visual n-gram models that can predict arbitrary phrases that are relevant to the content of an image. Our visual n-gram models are feed-forward convolutional networks trained using new loss functions that are inspired by n-gram models commonly used in language modeling. We demonstrate the merits of our models in phrase prediction, phrase-based image retrieval, relating images and captions, and zero-shot transfer.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | ImageNet-1k (val) | -- | 1453 | |
| Image Classification | ImageNet 1k (test) | Top-1 Accuracy11.5 | 798 | |
| Image Classification | ImageNet | -- | 429 | |
| Image Retrieval | MS-COCO 5K (test) | R@15 | 217 | |
| Image Retrieval | Flickr30k (test) | R@18.8 | 195 | |
| Text Retrieval | MS-COCO 5K (test) | R@18.7 | 182 | |
| Text Retrieval | Flickr30k (test) | R@115.4 | 89 | |
| Image Classification | SUN | Accuracy23 | 27 | |
| Image Classification | aYahoo | Accuracy72.4 | 2 |