| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| MSCOCO | ChairS69.6 | 26 | 1mo ago | ||
| FOIL-COCO (test) | RefCLIP-S | Accuracy92.6 | 20 | 1mo ago | |
| POPE averaged across MS-COCO, A-OKVQA, and GQA (Adversarial) | VAF | Accuracy0.807 | 12 | 1mo ago | |
| POPE averaged across MS-COCO, A-OKVQA, and GQA (Popular) | VAF | Accuracy85.2 | 12 | 1mo ago | |
| POPE averaged across MS-COCO, A-OKVQA, and GQA (Random) | VAF | Accuracy90.1 | 12 | 1mo ago | |
| FOIL (test) | RefFLEUR | Accuracy98.4 | 9 | 1mo ago | |
| MSCOCO Average performance across VLMs (test) | Overthinking Score | AUC87.33 | 8 | 1mo ago | |
| MSCOCO Qwen3-VL 3 (test) | Overthinking Score | AUC86.89 | 8 | 1mo ago | |
| MSCOCO Gemma 3 (test) | Overthinking Score | AUC85.59 | 8 | 1mo ago | |
| MSCOCO LLaVA 1.5 (test) | Overthinking Score | AUC89.73 | 8 | 1mo ago | |
| AMBER out-of-distribution (OOD) | Overthinking Score | AUC0.8611 | 8 | 1mo ago | |
| MS COCO 2014 (val) | HaloProbe | Accuracy90 | 5 | 11d ago | |
| POPE (test) | MAI | Final Performance Score89.4 | 5 | 12d ago |