| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Long-context Memory Evaluation | LongMemEval | Single-Turn Preference100 | 28 | |
| Long-term Memory Evaluation | LongMemEval S (test) | KU (Knowledge Update)94.4 | 27 | |
| Memory | LongMemEval | Accuracy34.72 | 25 | |
| End-to-End Performance | LongMemEval | Top-5 Recall59.9 | 20 | |
| Runtime Agent Memory | LongMemEval | F1 Score40.53 | 20 | |
| Retrieval | LongMemEval | Recall@585.8 | 18 | |
| Dialogue Memory Accuracy | LongMemEval-S (N=500) | Temporal Accuracy91 | 17 | |
| Long-context memory evaluation | LongMemEval-s | Overall Score75 | 12 | |
| Long-term dialogue memory | LongMemEval (test) | Accuracy85.75 | 11 | |
| Long-term Memory Retrieval | LongMemEval-S | SSU100 | 9 | |
| Context Selection | LongMemEval session-level (test) | F1 Score0.649 | 8 | |
| Single-session-user | LongMemEval S (test) | F1 Score0.1948 | 7 | |
| Multi-session | LongMemEval S (test) | F1 Score6.61 | 7 | |
| Temporal-reasoning | LongMemEval S (test) | F1 Score15.03 | 7 | |
| Single-session-preference | LongMemEval S (test) | F1 Score14.14 | 7 | |
| Single-session-user | LongMemEval-M | F1 Score8.67 | 7 | |
| Knowledge-update | LongMemEval-M | F1 Score6.23 | 7 | |
| Multi-session | LongMemEval-M | F1 Score4.81 | 7 | |
| Temporal-reasoning | LongMemEval-M | F1 Score12.69 | 7 | |
| Single-session-assistant | LongMemEval-M | F110.88 | 7 | |
| Single-session-preference | LongMemEval-M | F1 Score13.79 | 7 | |
| Knowledge-update | LongMemEval S | F1 Score11.83 | 7 | |
| Single-session-assistant | LongMemEval-S | F1 Score17.92 | 7 | |
| Memory Recall Efficiency | LongMemEval-S | Memory Footprint (tokens)1,091.51 | 6 | |
| Question Answering | LongMemEval | Accuracy65 | 6 |