| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| ARMMAN | OW-L | Accuracy85.78 | 9 | 14d ago | |
| MMLU | Accuracy91.02 | 9 | 14d ago | ||
| Ultrafeedback | OW-L | Accuracy73.66 | 9 | 14d ago | |
| GPQA | CoTAgent | Calls198 | 9 | 3mo ago | |
| AIME 24 | CoTAgent | Calls30 | 9 | 3mo ago | |
| Reasoning Benchmarks Cooperative AutoGen framework (test) | MARSHAL (Generalist, 8B) | Overall Accuracy83.58 | 2 | 3mo ago | |
| Reasoning Benchmarks Competitive MAD framework (test) | MARSHAL (Generalist, 8B) | Average Score0.8509 | 2 | 3mo ago |