| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Long-term memory evaluation | LoCoMo | Overall F158 | 70 | |
| Multi-hop Question Answering | LoCoMo | F148.35 | 67 | |
| Long-context Question Answering | LoCoMo | Average F157.32 | 64 | |
| Long-context Memory Retrieval | LoCoMo | Single-hop97.1 | 55 | |
| Single-hop Question Answering | LoCoMo | F10.6408 | 53 | |
| Open-domain Question Answering | LoCoMo | F10.4013 | 53 | |
| Long-context reasoning and retrieval | LoCoMo (test) | Single-Hop F195.12 | 37 | |
| Temporal Question Answering | LoCoMo | F10.6634 | 36 | |
| Long-form Dialogue | LoCoMo | EM37.24 | 32 | |
| Memory | LoCoMo | Accuracy30.18 | 25 | |
| Memory | LoCoMo | Execution Time (min)21.7 | 25 | |
| Long-context Reasoning | LoCoMo | Average F144.94 | 25 | |
| Open-Domain Question Answering | LoCoMo Open-Domain (test) | F1 Score15.12 | 24 | |
| Temporal Question Answering | LoCoMo Temporal (test) | F1 Score44.09 | 24 | |
| Single-Hop Question Answering | LoCoMo Single-Hop (test) | F137.9 | 24 | |
| Multi-Hop Question Answering | LoCoMo Multi-Hop (test) | F1 Score26.55 | 24 | |
| Long-context Question Answering | LoCoMo | Single-Hop LLJ Score97.1 | 24 | |
| Question Answering | LoCoMo | Single Hop F167.13 | 22 | |
| Long-horizon Question Answering | LoCoMo | Multi-Hop RGE-L0.2568 | 20 | |
| Long-horizon Question Answering | LoCoMo Overall All Categories 1.0 | EM Rank4.63 | 20 | |
| Long-horizon Question Answering | LoCoMo Single-Hop 1.0 | EM16.77 | 20 | |
| Long-horizon Question Answering | LoCoMo Open-Domain 1.0 | EM7.29 | 20 | |
| Long-horizon Question Answering | LoCoMo Temporal 1.0 | EM1,121 | 20 | |
| Long-horizon Question Answering | LoCoMo Multi-Hop 1.0 | EM426 | 20 | |
| Conversational Question Answering | LoCoMo Overall | Avg Rank (F1)1 | 20 |