| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Visual Question Answering | OVEN Query 1.0 (test) | HM30.9 | 15 | |
| Fine-grained Entity Recognition | OVEN Entity 1.0 (test) | HM29.6 | 15 | |
| (Image, Text)-to-Multimodal Retrieval | OVEN | R@575.3 | 14 | |
| (Image, Text)-to-Text Retrieval | OVEN | Recall@557.8 | 14 | |
| Open-Vocabulary Entity Recognition | OVEN | EM0.789 | 8 | |
| Multi-modal retrieval (Image-Text to Text/Image-Text) | OVEN QS | Recall@58.39 | 7 | |
| Visual Entity Recognition | OVEN (test) | Top-1 Acc (Seen)33.6 | 7 | |
| Multimodal Retrieval | OVEN-8 | R@575.98 | 6 | |
| Multimodal Retrieval | OVEN-6 | R@558.17 | 6 | |
| Visual Question Answering | OVEN | EM15.88 | 6 | |
| Open-domain Visual Entity Recognition | OVEN Wiki (human evaluation set) | Score (Seen Entities)76.1 | 6 | |
| Image-text-to-multimodal retrieval | OVEN M-BEIR (test) | Recall@567.6 | 4 | |
| Image-text-to-text retrieval | OVEN M-BEIR (test) | Recall@546.9 | 4 | |
| Open-Vocabulary Entity Grounding | OVEN (test) | Accuracy23.1 | 2 |