Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Locomo

Benchmarks

Task NameDataset NameSOTA ResultTrend
Long-term memory evaluationLoCoMo
Overall F192.3
119
Long-context Question AnsweringLoCoMo
F1 (Multi Hop)45.1
109
Long-context Memory RetrievalLoCoMo
Single-hop97.1
70
Multi-hop Question AnsweringLoCoMo
F148.35
67
Single-hop Question AnsweringLoCoMo
F10.6408
53
Open-domain Question AnsweringLoCoMo
F10.4013
53
Temporal ReasoningLoCoMo
F1 Score65.06
45
Long-context ReasoningLoCoMo
Average F144.94
45
Question AnsweringLoCoMo
Single Hop F167.13
38
Long-context reasoning and retrievalLoCoMo (test)
Single-Hop F195.12
37
Temporal Question AnsweringLoCoMo
F10.6634
36
Open-DomainLoCoMo
F1 Score48.38
35
Long-form DialogueLoCoMo
EM37.24
32
Overall Reasoning (Average)LoCoMo
F1 Score (LoCoMo)43.14
28
Single-Hop ReasoningLoCoMo
F1 Score57.55
28
Multi-Hop ReasoningLoCoMo
F1 Score35.88
28
Long-term Question AnsweringLoCoMo
Multi-Hop F144.24
27
MemoryLoCoMo
Accuracy30.18
25
MemoryLoCoMo
Execution Time (min)21.7
25
Open-Domain Question AnsweringLoCoMo Open-Domain (test)
F1 Score15.12
24
Temporal Question AnsweringLoCoMo Temporal (test)
F1 Score44.09
24
Single-Hop Question AnsweringLoCoMo Single-Hop (test)
F137.9
24
Multi-Hop Question AnsweringLoCoMo Multi-Hop (test)
F1 Score26.55
24
Long-context Question AnsweringLoCoMo
Single-Hop LLJ Score97.1
24
Long-Memory Question AnsweringLoCoMo
Accuracy (Single-Hop)97.5
22
Showing 25 of 114 rows