Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

HELM

Benchmarks

Task NameDataset NameSOTA ResultTrend
Hallucination DetectionHELM Passage Level v1.0 (test)
AUC0.9599
84
Hallucination DetectionHELM Sentence Level v1.0 (test)
AUC0.8835
84
Language ModelingHELM macro-averaged (test)
Accuracy73.2
30
Predictive LLM RoutingHELM Lite
OpenAI Performance64.3
26
Natural Language ReasoningHELM
Synth. Reason. (AS)54
16
Showing 5 of 5 rows