| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| ConFiQA (test) | ProbeRAG | F1 Score95.7 | 36 | 1mo ago | |
| ActivityNet | Accuracy62.3 | 29 | 3mo ago | ||
| Vad-Reasoning-Plus | Qwen3VL-Thinking | BLEU-30.106 | 27 | 3mo ago | |
| MSVD | MiniGPT4-Video | Accuracy73.92 | 22 | 3mo ago | |
| TruthfulQA | MoLaCE | Neutral Accuracy74.24 | 15 | 3mo ago | |
| SAGE Web Search | Weighted Recall (Com. Sci.)35.1 | 12 | 3mo ago | ||
| MMAD (test) | MAU-GPT | ROUGE-10.7026 | 12 | 3mo ago | |
| HybridQA (test) | ToT | Accuracy91 | 11 | 2mo ago | |
| MoreHopQA (test) | RouteGoT | Accuracy77 | 11 | 2mo ago | |
| HotpotQA (test) | RouteGoT | Accuracy88 | 11 | 2mo ago | |
| TREC-DL-NF (S5) | MinosEval | Kendall's Tau (K)68.61 | 11 | 3mo ago | |
| ANTIQUE (S5) | MinosEval | Kendall's Tau (K)65.97 | 11 | 3mo ago | |
| PHYSOLYM-A v1 (held-out) | Problem-level Score33.4 | 9 | 19d ago | ||
| OlymBench Phys v1 (test) | Problem Level Score53.9 | 9 | 19d ago | ||
| PUB-OE v3 (test) | Physics-R1 (dense) | Subpart AND (v3)37.7 | 9 | 19d ago | |
| PhysReason v2 (test) | GPT-4o | Subpart-AND (v2)51.1 | 9 | 19d ago | |
| Proposed LLM-based evaluation benchmark OEQ | Completeness96.9 | 9 | 3mo ago | ||
| QAEGO4D (test) | GroundVQAB | ROUGE30.4 | 9 | 3mo ago | |
| LingoQA | QwenVL 3.5 | ROUGE-L32 | 8 | 12d ago | |
| CrossAlpaca-Eval en 2.0 | Qwen2.5-7B-Instruction | GPT-4o Score8.58 | 8 | 1mo ago | |
| Qasper | w/t BoT | Accuracy13.91 | 7 | 12d ago | |
| NarrativeQA | w/t BoT | Accuracy73.77 | 7 | 12d ago | |
| Earth Observation | Qwen3 | Judge Score97.05 | 7 | 1mo ago | |
| OKVQA | PStar | LVM Evaluation Score71.6 | 6 | 14d ago | |
| PBSBench Slide-level 1.0 (test) | PBS-VL | BLEU-136 | 6 | 1mo ago |