Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MTBench

Benchmarks

Task NameDataset NameSOTA ResultTrend
LLM-as-a-JudgeMTbench (test)
StdDev2.24
45
General CapabilityMTBench
MTBench Score9.14
43
Multi-turn DialogueMTBench101
Score9.03
33
Helpfulness EvaluationMTBench
Helpfulness9.35
18
Multi-modal Instruction FollowingMM MTBench
Overall Score84.9
18
Temperature ForecastingMTBench Temperature Forecasting (14-day)
MSE5.026
17
Stock Price ForecastingMTBench Stock Price Forecasting (30-day)
MAE1.122
17
Stock Price ForecastingMTBench Stock Price Forecasting 7-day
MAE0.788
17
Pair-wise comparisonMTBench Human
Accuracy88.9
16
Math TutoringMTBench MathTutorBench OOD
Score (Sc)8.29
13
Trend PredictionMTBench Weather (Long)
Past Accuracy93.496
10
Trend PredictionMTBench Weather Short
Past Trend Prediction Score93.877
10
Trend PredictionMTBench Finance Long
3-way Accuracy62.671
10
Trend PredictionMTBench Finance Short
3-way Score66.849
10
Time Series ForecastingMTBench Weather Long
MSE11.823
10
Time Series ForecastingMTBench Weather Short
MSE10.02
10
Time Series ForecastingMTBench Finance (Long)
MAPE3.531
10
Time Series ForecastingMTBench Finance Short
MAPE2.545
10
Weather Indicator PredictionMTBench Weather Indicator
Max MSE10.898
9
Weather ForecastingMTBench 7-day weather forecast
MSE10.719
9
Question AnsweringMTBench Weather
Accuracy71.7
9
Question AnsweringMTBench Finance
Accuracy91.3
9
RegressionMTBench Weather
MAE3.523
9
RegressionMTBench Finance
MAE0.814
9
ClassificationMTBench Weather
Accuracy55.7
9
Showing 25 of 36 rows