Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

HELMET

Benchmarks

Task NameDataset NameSOTA ResultTrend
Long-context language model evaluationHELMET
Average Score55.2
39
Long-context language modeling evaluationHELMET
Average Sparsity0
28
Long-context language modelingHELMET
Summarization Score247
27
Long-context UnderstandingHELMET 2025
Accuracy (8K Context)61.44
16
Long Context UnderstandingHELMET
Accuracy68.5
15
Long-context language modeling evaluationHELMET held-out eval
Accuracy (8K Context)57.61
13
Question AnsweringHELMET RAG subset
HotpotQA Accuracy81.1
8
Long-context Multimodal UnderstandingHelmet
Accuracy67.6
6
Long-context UnderstandingHELMET shorter context lengths ≤32K
Score (8K Context)59.34
4
Holistic long-context understandingHELMET Holistic understanding
HELMET Holistic Understanding (64K Context)46.51
4
Showing 10 of 10 rows