| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Long-term memory evaluation | LoCoMo | Overall F192.3 | 119 | |
| Long-context Question Answering | LoCoMo | F1 (Multi Hop)45.1 | 109 | |
| Long-context Memory Retrieval | LoCoMo | Single-hop97.1 | 70 | |
| Multi-hop Question Answering | LoCoMo | F148.35 | 67 | |
| Single-hop Question Answering | LoCoMo | F10.6408 | 53 | |
| Open-domain Question Answering | LoCoMo | F10.4013 | 53 | |
| Temporal Reasoning | LoCoMo | F1 Score65.06 | 45 | |
| Long-context Reasoning | LoCoMo | Average F144.94 | 45 | |
| Question Answering | LoCoMo | Single Hop F167.13 | 38 | |
| Long-context reasoning and retrieval | LoCoMo (test) | Single-Hop F195.12 | 37 | |
| Temporal Question Answering | LoCoMo | F10.6634 | 36 | |
| Open-Domain | LoCoMo | F1 Score48.38 | 35 | |
| Long-form Dialogue | LoCoMo | EM37.24 | 32 | |
| Overall Reasoning (Average) | LoCoMo | F1 Score (LoCoMo)43.14 | 28 | |
| Single-Hop Reasoning | LoCoMo | F1 Score57.55 | 28 | |
| Multi-Hop Reasoning | LoCoMo | F1 Score35.88 | 28 | |
| Long-term Question Answering | LoCoMo | Multi-Hop F144.24 | 27 | |
| Memory | LoCoMo | Accuracy30.18 | 25 | |
| Memory | LoCoMo | Execution Time (min)21.7 | 25 | |
| Open-Domain Question Answering | LoCoMo Open-Domain (test) | F1 Score15.12 | 24 | |
| Temporal Question Answering | LoCoMo Temporal (test) | F1 Score44.09 | 24 | |
| Single-Hop Question Answering | LoCoMo Single-Hop (test) | F137.9 | 24 | |
| Multi-Hop Question Answering | LoCoMo Multi-Hop (test) | F1 Score26.55 | 24 | |
| Long-context Question Answering | LoCoMo | Single-Hop LLJ Score97.1 | 24 | |
| Long-Memory Question Answering | LoCoMo | Accuracy (Single-Hop)97.5 | 22 |