Learning Visual N-Grams from Web Data

About

Real-world image recognition systems need to recognize tens of thousands of classes that constitute a plethora of visual concepts. The traditional approach of annotating thousands of images per class for training is infeasible in such a scenario, prompting the use of webly supervised data. This paper explores the training of image-recognition systems on large numbers of images and associated user comments. In particular, we develop visual n-gram models that can predict arbitrary phrases that are relevant to the content of an image. Our visual n-gram models are feed-forward convolutional networks trained using new loss functions that are inspired by n-gram models commonly used in language modeling. We demonstrate the merits of our models in phrase prediction, phrase-based image retrieval, relating images and captions, and zero-shot transfer.

Ang Li, Allan Jabri, Armand Joulin, Laurens van der Maaten• 2016

Related benchmarks

Task	Dataset	Result
Image Classification	ImageNet-1k (val)	--	1498
Image Classification	ImageNet 1k (test)	Top-1 Accuracy11.5	880
Image Classification	ImageNet	--	431
Image Retrieval	MS-COCO 5K (test)	R@15	217
Image Retrieval	Flickr30k (test)	R@18.8	213
Text Retrieval	MS-COCO 5K (test)	R@18.7	182
Text Retrieval	Flickr30k (test)	R@115.4	107
Image Classification	SUN	Accuracy23	65
Image Classification	aYahoo	Accuracy72.4	2

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord