Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

OOLONG

Benchmarks

Task NameDataset NameSOTA ResultTrend
Long-context reasoningOOLONG
Accuracy68.4
37
Long-context reasoningOOLONG trec_coarse
Score86.6
28
Long-context reasoningOOLONG
Latency (s)7.1
27
Long-Context ReasoningOolong-Synth
Accuracy78.41
11
Long-document Question AnsweringOolong
Accuracy68
10
ReasoningOolong real 2025 (test)
Score15.1
9
ReasoningOolong real
Score0.151
9
Long-context Question AnsweringOolong Real
Score37.46
9
Long-context Question AnsweringOolong Synthetic
Score71.75
8
Long-context ClassificationOOLONG (test)
TREC-Q-coarse Accuracy58.1
6
Long-context reasoningOOLONG-REAL 650 samples (175K bucket)
Average Reward0.249
2
Showing 11 of 11 rows