Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Locomo

Benchmarks

Task NameDataset NameSOTA ResultTrend
Long-context Question AnsweringLoCoMo
F1 (Multi Hop)50.25
171
Long-term memory evaluationLoCoMo
Overall F193
128
Multi-hop Question AnsweringLoCoMo
F148.35
125
Single-hop Question AnsweringLoCoMo
F10.6408
111
Open-domain Question AnsweringLoCoMo
F10.4013
111
Temporal Question AnsweringLoCoMo
F10.6634
85
Long-context Memory RetrievalLoCoMo
Single-hop97.1
80
Long-context ReasoningLoCoMo
Average F195.06
75
Multi-Hop ReasoningLoCoMo
F1 Score41.62
68
Long-term conversational memoryLoCoMo
Overall Acc (LoCoMo)79.44
59
Long-context Conversational Question AnsweringLoCoMo
Multi-Hop F143.1
59
Long-context ManagementLoCoMo
F1 Score65.2
57
Overall Reasoning (Average)LoCoMo
F1 Score (LoCoMo)44.68
52
Open-DomainLoCoMo
F1 Score48.38
51
Single-HopLoCoMo
F1 Score59.03
47
TemporalLoCoMo
F1 Score0.4935
47
Question AnsweringLoCoMo
Single Hop F167.13
45
Long-context Question AnsweringLoCoMo
Single-Hop LLJ Score97.1
45
Temporal ReasoningLoCoMo
F1 Score65.06
45
Long-context reasoning and retrievalLoCoMo (test)
Single-Hop F195.12
37
Conversation Question AnsweringLOCOMO (test)
RAG F136.75
36
Long-form DialogueLoCoMo
EM37.24
32
Single-Hop ReasoningLoCoMo
F1 Score57.55
28
Long-term Question AnsweringLoCoMo
Multi-Hop F144.24
27
Long-term dialogue memoryLoCoMo (test)
Accuracy84.23
27
Showing 25 of 178 rows
...