Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

LM Eval

Benchmarks

Task NameDataset NameSOTA ResultTrend
Zero-shot downstream task evaluationLM-EVAL (Average of HellaSwag, PIQA, ARC-Easy, ARC-Challenge, and WinoGrande) zero-shot latest
Average Accuracy76
30
Question Answering and Commonsense ReasoningLM Eval ARCC, ARCE, HellaSwag, PIQA 0.4.4 standard (test)
ARCC61.6
18
Showing 2 of 2 rows