Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LongBench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Long-context Language ModelingLongBench
Average Score58.4
328
Long-context Language UnderstandingLongBench
M-Avg60.31
294
Long-context language understandingLongBench (test)
Average Score51.87
147
Long-context understandingLongBench
F1 Score34
143
Long-context understandingLongBench (test)
Avg Score58.7
136
Long Context UnderstandingLongBench V2
Overall Score82.36
133
Query RoutingLongBench OOD v2
QA53
120
Long-context understandingLongBench
Overall Average Score62.1
115
Long-context ReasoningLongBench
Accuracy (LongBench)70.4
101
Long-context language understandingLongBench-e
Average Score53.04
93
Long-context EvaluationLongBench
Average Score31.96
90
Long-context ReasoningLongBench v2
Average Score68.2
88
Long-context Language UnderstandingLongBench
Average Score58.4
86
Long-context understandingLongBench 1.0 (test)
NarrativeQA32.94
84
Long-context understandingLongBench
HotpotQA57.15
82
Single-Doc Question AnsweringLongBench
MultifieldQA Score53.67
75
Long-context understandingLongBench (test)
FewShot Performance71.4
72
Long-context Question AnsweringLongBench (test)
HotpotQA7,011
69
Question AnsweringLongBench Qasper
F10.4459
62
Long-context language understandingLongBench v2
Overall Accuracy46.32
62
Long-context ReasoningLongBench
Score73.8
62
Long-context language understandingLongBench 1.0 (test)
MultiNews61.5
61
Long-context UnderstandingLongBench
Accuracy103
60
Long document retrievalLongBench Retrieval v2 (full)
F1 Score0.4843
55
Few-shot LearningLongBench
TREC Score82.62
51
Showing 25 of 205 rows
...