| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| Multi-task Evaluation Suite | SPACE | Average Performance49 | 21 | 3mo ago | |
| Aggregate HotPotQA, LiveBench-Math, Formula | RPT | Aggregate Score74.3 | 16 | 12d ago | |
| Mit-Movie, TweetNER7, New York Times, CoNLL04, FindVehicle, and FabNER | SFT_Qwen | Precision85.67 | 13 | 2mo ago | |
| Average All Benchmarks | Accuracy71 | 9 | 19d ago | ||
| Overall Across All Benchmarks | AdaMMS | SUM563.56 | 8 | 3mo ago | |
| BoolQ, ARC-C, ARC-E, HellaSwag Aggregate | Average Accuracy70.1 | 5 | 3mo ago | ||
| MultiBLiMP, Belebele, ARCx, MMLUx, Exams Aggregate | Gemma 2 | AVG Borda2 | 4 | 2mo ago | |
| All Average | Adam+Nexus | Accuracy40.3 | 3 | 1mo ago |