| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| Short-context benchmarks ARC-C, ARC-E, PIQA, Winogrande, HellaSwag | ARC-C Accuracy63.48 | 45 | 1mo ago | ||
| NLP Benchmark Suite Zero-shot (HellaSwag, RACE, PIQA, WinoGrande, ARC, OBQA) (test) | LLAMA-30B | HellaSwag Accuracy63.36 | 28 | 20d ago | |
| Zero-Shot Evaluation Suite (Arc-e, Arc-c, Boolq, Hellaswag, Openbookqa, Piqa, SciQ, Winogrande) | StableQAT | ARC-E65.74 | 18 | 3mo ago | |
| LM Eval ARCC, ARCE, HellaSwag, PIQA 0.4.4 standard (test) | ARCC61.6 | 18 | 3mo ago | ||
| lm-eval-harness PIQA, COPA, OpenBookQA, Winogrande, SciQA, ARC-E, ARC-C | PIQA Accuracy78.8 | 10 | 3mo ago | ||
| Downstream Tasks (ARC-C, HellaSwag, PIQA, WinoGrande) zero-shot | ARC-C Accuracy55.8 | 3 | 23d ago |