Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Babilong

Benchmarks

Task NameDataset NameSOTA ResultTrend
Long-context ReasoningBABILong 16k
Accuracy29.8
72
Long-context ReasoningBABILong 8k
Accuracy34.7
65
Long-context ReasoningBABILong 4k
Accuracy (BABILong 4k)38.5
51
Question AnsweringBabilong 16k context length
QA1 Accuracy58
9
Needle-in-a-Haystack (NIAH) retrievalBABILong 8K
Accuracy57
6
Needle-in-a-Haystack (NIAH) retrievalBABILong 4K
Accuracy61.2
6
Needle-in-a-Haystack (NIAH) retrievalBABILong 2K
Accuracy65.2
6
Long-context reasoningBABILong
Err (2k Context)14.1
6
Question AnsweringBabilong 128k context length
QA1 Score38
5
Question AnsweringBabilong 64k context length
QA1 Score25
5
Needle-in-a-Haystack RetrievalBABILong 32K context length
Accuracy9
3
Needle-in-a-Haystack RetrievalBABILong 16K context length
Needle-in-a-Haystack Accuracy (16K)22.2
3
Showing 12 of 12 rows