Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

LongBench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Long-context Language UnderstandingLongBench
M-Avg60.31
219
Long-context language understandingLongBench (test)
Average Score51.87
133
Long-context understandingLongBench
Overall Average Score62.1
115
Long-context understandingLongBench (test)
Avg Score54
80
Long-context ReasoningLongBench
Score73.8
62
Long-context UnderstandingLongBench
Accuracy103
60
Long-context Question AnsweringLongBench (test)
HotpotQA7,011
59
Long document retrievalLongBench Retrieval v2 (full)
F1 Score0.4843
55
Long-context ReasoningLongBench v2
Average Score68.2
48
Long-context Language ModelingLongBench
Single-Document QA42.77
44
Long-context language modelingLongBench-E 1.0 (test)
S-Doc QA Perf.49.92
37
Long Context UnderstandingLongBench V2
Overall Score65.6
37
Single-Doc Question AnsweringLongBench
MultifieldQA Score49.34
36
Long-context understandingLongBench 1.0 (test)
NarrativeQA26.63
32
Long-context understandingLongBench V1
NQA31
30
Long-context understandingLongBench (test)
SingleDoc Performance45.2
30
Long-context understandingLongBench
2WikiMQA55.13
25
Long-context Question AnsweringLongBench V2
SingleDoc Accuracy51.43
22
Long-context understandingLongBench v1 (test)
SD QA49.6
21
Long-context language understandingLongBench 1.0 (test)
MultiNews61.5
21
Long-context language understandingLongBench v2
Overall Accuracy46.32
20
Long-context understandingLongBench
MFQA30.94
18
Long-context understandingLongBench
Overall Average Score31.8
17
Long-context Language UnderstandingLongBench-e (test)
LCC (Language Comprehension Score)68.42
16
Long-context generationLongBench
Average Score48.5
16
Showing 25 of 81 rows