| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Mathematical Reasoning | WeMath | Accuracy80.6 | 225 | |
| Multimodal Math Reasoning | WeMath | Accuracy98.7 | 211 | |
| Multimodal Reasoning | WeMath | Accuracy72.2 | 171 | |
| Visual Mathematical Reasoning | WeMath | Accuracy98.7 | 149 | |
| Image Reasoning | WeMath | Accuracy71.3 | 34 | |
| Mathematical multi-modal reasoning | WeMath | Pass@185.11 | 30 | |
| Mathematical Reasoning | WeMath 525 samples | Accuracy78.5 | 24 | |
| Visual Mathematical Reasoning | WeMath strict | Accuracy44.8 | 18 | |
| Multimodal Mathematical Reasoning | WeMath mini (test) | Accuracy79.5 | 18 | |
| Step-wise Verification | WeMath | Macro F163.9 | 18 | |
| Multimodal Mathematical Reasoning | WeMath (test) | Accuracy72.15 | 17 | |
| Visual Reasoning | WeMath strict | Score39 | 12 | |
| Visual Mathematical Reasoning | WeMath Loose | Score79 | 10 | |
| Multimodal Mathematical Reasoning | WeMath | WeMath-S Score36.33 | 8 | |
| Multimodal Scientific Reasoning | WeMath | Accuracy71.77 | 8 | |
| Mathematical reasoning | WeMath-L | Score82.19 | 6 | |
| Mathematical reasoning | WeMath-S | Score68.86 | 6 | |
| Multidisciplinary Reasoning | WeMath | Accuracy61.6 | 6 | |
| Mathematical Reasoning | WeMath loose | Accuracy52.1 | 6 | |
| First Incorrect Step Identification | WeMath | FISI F1 Score24.9 | 6 | |
| Multimodal Mathematical Reasoning | WeMath 19 | Macro Average Score61.52 | 2 |