| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Human Preference Prediction | internal benchmark | Accuracy82.2 | 14 | |
| Mathematical Reasoning | Internal Benchmark | Average Score65.5 | 5 | |
| Agent Action Safety Verification | internal benchmark 300-scenario | Verdict Accuracy95 | 5 | |
| General Multimodal Intelligence Evaluation | Internal Benchmark (test) | Overall Score61.6 | 5 |