| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| LoCoMo | R2-Mem | F1 Score41.62 | 68 | 20d ago | |
| StrategyQA | OpenMath2-Llama3.1-70B* | Accuracy95.6 | 50 | 5d ago | |
| MuSiQue | Accuracy51 | 48 | 1d ago | ||
| MuSiQue | GPT-4o-0806 | EM53 | 41 | 3mo ago | |
| 2WikiMQA IRCoT 500 samples (test) | ActiShade | ACC52.8 | 27 | 3mo ago | |
| HotpotQA IRCoT (500 samples) (test) | ActiShade | ACC54.6 | 27 | 3mo ago | |
| MuSiQue IRCoT 500 samples (test) | ActiShade | ACC25.59 | 27 | 3mo ago | |
| 2WikiMHQA | CoT-UQ | AUROC0.7002 | 26 | 3mo ago | |
| HotpotQA | CoT-UQ | AUROC67.19 | 26 | 3mo ago | |
| HotpotQA | Qwen3-14B | Accuracy72.2 | 23 | 1d ago | |
| HotpotQA | BGE-reranker | F1 Score55 | 22 | 23h ago | |
| TAT-QA | OCC-RAG-1.7B | F1 Score81 | 21 | 1d ago | |
| TriviaQA | Prompt-R1 | Exact Match (EM)70.31 | 17 | 2mo ago | |
| RULER QA | Qwen2.5-14B-1M-LongRLVR | Accuracy (32K Context)95.4 | 17 | 3mo ago | |
| Housing QA | TOTAL | Accuracy82.67 | 15 | 1mo ago | |
| FanOutQA | TOTAL | F1 Score71.84 | 15 | 1mo ago | |
| CRAG | TOTAL | F1 Score30.08 | 15 | 1mo ago | |
| MuSiQue | TOTAL | F1 Score73.3 | 15 | 1mo ago | |
| CommaQA-E compositional | ChatGPT (SKiC) | Exact Match80.8 | 15 | 3mo ago | |
| CommaQA-E (test) | ChatGPT (SKiC) | Exact Match70 | 15 | 3mo ago | |
| MuSiQue | CoT | Relative Cost1 | 14 | 1mo ago | |
| HotpotQA | CoT | Relative Cost1 | 14 | 1mo ago | |
| MuSiQue | HeadRank | Recall@242.67 | 13 | 1mo ago | |
| HotpotQA | HeadRank | Recall@272.65 | 13 | 1mo ago | |
| 2WikiMultihopQA | GeoFaith | Accuracy82.1 | 12 | 7d ago |