| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| Multi-task Evaluation Suite | SPACE | Average Performance49 | 21 | 1mo ago | |
| Mit-Movie, TweetNER7, New York Times, CoNLL04, FindVehicle, and FabNER | SFT_Qwen | Precision85.67 | 13 | 15d ago | |
| Overall Across All Benchmarks | AdaMMS | SUM563.56 | 8 | 1mo ago | |
| BoolQ, ARC-C, ARC-E, HellaSwag Aggregate | Average Accuracy70.1 | 5 | 1mo ago | ||
| MultiBLiMP, Belebele, ARCx, MMLUx, Exams Aggregate | Gemma 2 | AVG Borda2 | 4 | 1mo ago | |
| All Average | Adam+Nexus | Accuracy40.3 | 3 | 5d ago |