A Generative Approach for Wikipedia-Scale Visual Entity Recognition

About

In this paper, we address web-scale visual entity recognition, specifically the task of mapping a given query image to one of the 6 million existing entities in Wikipedia. One way of approaching a problem of such scale is using dual-encoder models (eg CLIP), where all the entity names and query images are embedded into a unified space, paving the way for an approximate k-NN search. Alternatively, it is also possible to re-purpose a captioning model to directly generate the entity names for a given image. In contrast, we introduce a novel Generative Entity Recognition (GER) framework, which given an input image learns to auto-regressively decode a semantic and discriminative ``code'' identifying the target entity. Our experiments demonstrate the efficacy of this GER paradigm, showcasing state-of-the-art performance on the challenging OVEN benchmark. GER surpasses strong captioning, dual-encoder, visual matching and hierarchical classification baselines, affirming its advantage in tackling the complexities of web-scale recognition.

Mathilde Caron, Ahmet Iscen, Alireza Fathi, Cordelia Schmid• 2024

Related benchmarks

Task	Dataset	Result
Fine-grained Entity Recognition	OVEN Entity 1.0 (test)	HM22.7	15
Visual Entity Recognition	OVEN	HM (Unseen)17.7	15
Visual Question Answering	OVEN Query 1.0 (test)	HM6.3	15
Visual Entity Recognition	OVEN entity (test)	Top-1 Accuracy (Seen)29.1	11

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord