Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities
About
Large-scale multi-modal pre-training models such as CLIP and PaLI exhibit strong generalization on various visual domains and tasks. However, existing image classification benchmarks often evaluate recognition on a specific domain (e.g., outdoor images) or a specific task (e.g., classifying plant species), which falls short of evaluating whether pre-trained foundational models are universal visual recognizers. To address this, we formally present the task of Open-domain Visual Entity recognitioN (OVEN), where a model need to link an image onto a Wikipedia entity with respect to a text query. We construct OVEN-Wiki by re-purposing 14 existing datasets with all labels grounded onto one single label space: Wikipedia entities. OVEN challenges models to select among six million possible Wikipedia entities, making it a general visual recognition benchmark with the largest number of labels. Our study on state-of-the-art pre-trained models reveals large headroom in generalizing to the massive-scale label space. We show that a PaLI-based auto-regressive visual recognition model performs surprisingly well, even on Wikipedia entities that have never been seen during fine-tuning. We also find existing pretrained models yield different strengths: while PaLI-based models obtain higher overall performance, CLIP-based models are better at recognizing tail entities.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Fine-grained Entity Recognition | OVEN Entity 1.0 (test) | HM11.5 | 15 | |
| Visual Question Answering | OVEN Query 1.0 (test) | HM3.5 | 15 | |
| Visual Entity Recognition | OVEN (test) | Top-1 Acc (Seen)33.6 | 7 | |
| Open-domain Visual Entity Recognition | OVEN Wiki (human evaluation set) | Score (Seen Entities)18 | 6 | |
| Open-domain Visual Entity Recognition | OVEN-Wiki (test) | Entity Split SEEN12.6 | 5 | |
| Open-Vocabulary Entity Grounding | OVEN (test) | Accuracy20 | 2 |