Multiple Object Recognition with Visual Attention

About

We present an attention-based model for recognizing multiple objects in images. The proposed model is a deep recurrent neural network trained with reinforcement learning to attend to the most relevant regions of the input image. We show that the model learns to both localize and recognize multiple objects despite being given only class labels during training. We evaluate the model on the challenging task of transcribing house number sequences from Google Street View images and show that it is both more accurate than the state-of-the-art convolutional networks and uses fewer parameters and less computation.

Jimmy Ba, Volodymyr Mnih, Koray Kavukcuoglu• 2014

Related benchmarks

Task	Dataset	Result
Image Classification	CIFAR-10	Accuracy64.81	973
Image Classification	ImageNet100 (test)	Top-1 Acc21.44	87
Classification	COCO	Accuracy43.32	31
Scanpath Alignment	CIFAR-10	DTW837.1	18
Image Classification	ImageNet	Top-1 Accuracy67.5	10
Visual Search	COCO-Search18 cross-task	Accuracy (%)14.34	7

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord