| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| MMLU, ARC-Challenge, and CommonsenseQA Aggregate | RAISE | Average Score64.77 | 24 | 3mo ago | |
| CodaSet OOD Average (test) | Qwen3-235B | Performance (%)87.84 | 16 | 8d ago | |
| BIG-Bench | TALE | Accuracy85.6 | 12 | 22d ago | |
| General Benchmarks Llama 3.1 8B | Generation Quality Score66.5 | 11 | 3mo ago | ||
| Combined Suite (HS, PIQA, SIQA, Wino, MMLU, NQ, TQA, ARC-C, ARC-E, OBQA, BoolQ, DROP, BBH-LB, GSM8K) | MobileMoE-L | Accuracy57.8 | 4 | 7d ago | |
| Overall Evaluation Suite | Qwen3-30B-A3B-Instruct-2507 | Average Score73.6 | 4 | 3mo ago | |
| BIG-Bench (test) | Best Model | Accuracy83.6 | 2 | 22d ago | |
| Linguistic Task | Comprehensive Score88.9 | 2 | 1mo ago |