HELMET

Benchmarks

Task Name	Dataset Name	SOTA Result
Long-context language model evaluation	HELMET	Average Score55.2	39
Text Question Answering	HELMET	Accuracy78.4	37
Long-context language modeling evaluation	HELMET	Average Sparsity0	28
Long-context language modeling	HELMET	Summarization Score247	27
Long-context downstream performance	HELMET	Score33.1	25
Long-context Understanding	HELMET 2025	Accuracy (8K Context)61.44	16
Long Context Understanding	HELMET	Accuracy68.5	15
Long-context language modeling evaluation	HELMET held-out eval	Accuracy (8K Context)57.61	13
Question Answering	HELMET RAG subset	HotpotQA Accuracy81.1	8
Long-context Multimodal Understanding	Helmet	Accuracy67.6	6
Long-context Understanding	HELMET shorter context lengths ≤32K	Score (8K Context)59.34	4
Holistic long-context understanding	HELMET Holistic understanding	HELMET Holistic Understanding (64K Context)46.51	4

Showing 12 of 12 rows