Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

LongMemEval

Benchmarks

Task NameDataset NameSOTA ResultTrend
Long-context Memory EvaluationLongMemEval
Single-Turn Preference100
28
Long-term Memory EvaluationLongMemEval S (test)
KU (Knowledge Update)94.4
27
MemoryLongMemEval
Accuracy34.72
25
End-to-End PerformanceLongMemEval
Top-5 Recall59.9
20
Runtime Agent MemoryLongMemEval
F1 Score40.53
20
RetrievalLongMemEval
Recall@585.8
18
Dialogue Memory AccuracyLongMemEval-S (N=500)
Temporal Accuracy91
17
Long-context memory evaluationLongMemEval-s
Overall Score75
12
Long-term dialogue memoryLongMemEval (test)
Accuracy85.75
11
Long-term Memory RetrievalLongMemEval-S
SSU100
9
Context SelectionLongMemEval session-level (test)
F1 Score0.649
8
Single-session-userLongMemEval S (test)
F1 Score0.1948
7
Multi-sessionLongMemEval S (test)
F1 Score6.61
7
Temporal-reasoningLongMemEval S (test)
F1 Score15.03
7
Single-session-preferenceLongMemEval S (test)
F1 Score14.14
7
Single-session-userLongMemEval-M
F1 Score8.67
7
Knowledge-updateLongMemEval-M
F1 Score6.23
7
Multi-sessionLongMemEval-M
F1 Score4.81
7
Temporal-reasoningLongMemEval-M
F1 Score12.69
7
Single-session-assistantLongMemEval-M
F110.88
7
Single-session-preferenceLongMemEval-M
F1 Score13.79
7
Knowledge-updateLongMemEval S
F1 Score11.83
7
Single-session-assistantLongMemEval-S
F1 Score17.92
7
Memory Recall EfficiencyLongMemEval-S
Memory Footprint (tokens)1,091.51
6
Question AnsweringLongMemEval
Accuracy65
6
Showing 25 of 30 rows