| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| LLM-as-a-Judge | MTbench (test) | StdDev2.24 | 45 | |
| General Capability | MTBench | MTBench Score9.14 | 43 | |
| Multi-turn Dialogue | MTBench101 | Score9.03 | 33 | |
| Pair-wise comparison | MTBench Human | Accuracy88.9 | 16 | |
| Trend Prediction | MTBench Weather (Long) | Past Accuracy93.496 | 10 | |
| Trend Prediction | MTBench Weather Short | Past Trend Prediction Score93.877 | 10 | |
| Trend Prediction | MTBench Finance Long | 3-way Accuracy62.671 | 10 | |
| Trend Prediction | MTBench Finance Short | 3-way Score66.849 | 10 | |
| Time Series Forecasting | MTBench Weather Long | MSE11.823 | 10 | |
| Time Series Forecasting | MTBench Weather Short | MSE10.02 | 10 | |
| Time Series Forecasting | MTBench Finance (Long) | MAPE3.531 | 10 | |
| Time Series Forecasting | MTBench Finance Short | MAPE2.545 | 10 | |
| Question Answering | MTBench Weather | Accuracy71.7 | 9 | |
| Question Answering | MTBench Finance | Accuracy91.3 | 9 | |
| Regression | MTBench Weather | MAE3.523 | 9 | |
| Regression | MTBench Finance | MAE0.814 | 9 | |
| Classification | MTBench Weather | Accuracy55.7 | 9 | |
| Classification | MTBench Finance | Accuracy54.3 | 9 | |
| Helpfulness Evaluation | MTBench | Helpfulness9.35 | 8 | |
| Temperature Forecasting | MTBench Temperature Forecasting (14-day) | MSE5.026 | 8 | |
| Temperature Forecasting | MTBench Temperature Forecasting 7-day | MSE4.021 | 8 | |
| Stock Indicator Forecasting | MTBench Stock Indicator Forecasting (30-day) | MACD Score3.342 | 8 | |
| Stock Indicator Forecasting | MTBench Stock Indicator Forecasting (7-day) | MACD2.047 | 8 | |
| Stock Price Forecasting | MTBench Stock Price Forecasting (30-day) | MAE1.122 | 8 | |
| Stock Price Forecasting | MTBench Stock Price Forecasting 7-day | MAE0.788 | 8 |