| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| HotpotQA In-Distribution | QwenLong-L1-32B | Accuracy85.2 | 72 | 4d ago | |
| LoCoMo | QRRanker | Average F157.32 | 64 | 4d ago | |
| LongBench (test) | LingoEDU | HotpotQA7,011 | 59 | 4d ago | |
| 2WikiMultiHopQA (Out-Of-Distribution) | ReMemR1 | Accuracy63.9 | 54 | 4d ago | |
| NarrativeQA | MiA-RAG | F1 Score53.56 | 38 | 2d ago | |
| DetectiveQA-En | MiA | Accuracy75.5 | 32 | 4d ago | |
| DetectiveQA-Zh | MiA | Accuracy0.8417 | 32 | 4d ago | |
| LoCoMo | Mnemis | Single-Hop LLJ Score97.1 | 24 | 4d ago | |
| LongBench V2 | SingleDoc Accuracy51.43 | 22 | 4d ago | ||
| FRAMES | Avg@4 Score73.54 | 22 | 4d ago | ||
| HotpotQA | Mean Score65.49 | 21 | 4d ago | ||
| Qasper | CE-GOCD | F183.09 | 17 | 2d ago | |
| MultiFieldQA | POP | Accuracy57.33 | 15 | 4d ago | |
| LV-Eval | PANINI | F1 Score14.81 | 14 | 4d ago | |
| L-Eval QA | DRIFT | NQ80.73 | 13 | 4d ago | |
| ∞Bench | StateLM-14B-RL | Accuracy78.46 | 13 | 4d ago | |
| NovelQA | StateLM-14B-RL | Accuracy84.85 | 13 | 4d ago | |
| NarrativeQA | Qwen2.5-OpAmp-72B | Exact Match61.7 | 11 | 4d ago | |
| LongBench Pro | GLM-4.1V-9B-Thinking VERA | F1 Score34.2 | 10 | 4d ago | |
| MuSiQue | GLM-4.1V-9B-Thinking VERA | F1 Score30.58 | 10 | 4d ago | |
| Qasper | GLM-4.1V-9B-Thinking VERA | Extract F154.57 | 10 | 4d ago | |
| DocMath | GLM-4.1V-9B-Thinking VERA | F1 Score29.02 | 10 | 4d ago | |
| LongBench-Cite Average | C Score77.6 | 9 | 4d ago | ||
| GovReport | C Score68.4 | 9 | 4d ago | ||
| Dureader | C Score81 | 9 | 4d ago |