Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

HELMET

Benchmarks

Task NameDataset NameSOTA ResultTrend
Long-context language modeling evaluationHELMET
Average Sparsity0
28
Long-context language modelingHELMET
Summarization Score247
27
Long-context UnderstandingHELMET 2025
Accuracy (8K Context)61.44
16
Long-context language modeling evaluationHELMET held-out eval
Accuracy (8K Context)57.61
13
Question AnsweringHELMET RAG subset
HotpotQA Accuracy81.1
8
Long-context Multimodal UnderstandingHelmet
Accuracy67.6
6
Long Context UnderstandingHELMET
Accuracy64.7
4
Showing 7 of 7 rows