Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Chatbot Evaluation on MT-Bench (GPT-4-Turbo Score)
Loading...
9.5
Score (GPT-4-Turbo)
Qwen3-32B + RLBFF training
8.9072
9.0611
9.215
9.3689
Sep 25, 2025
Score (GPT-4-Turbo)
Updated 15d ago
Evaluation Results
Method
Method
Links
Score (GPT-4-Turbo)
Qwen3-32B + RLBFF training
Input Cost (per M toke...
2025.09
9.5
DeepSeek R1
Input Cost (per M toke...
2025.09
9.49
Qwen3-32B + Baseline BT training
Input Cost (per M toke...
2025.09
9.45
Qwen3-32B
Input Cost (per M toke...
2025.09
9.38
o3-mini
Input Cost (per M toke...
2025.09
9.26
Claude-3.7-Sonnet (Thinking)
Input Cost (per M toke...
2025.09
8.93
Feedback
Search any
task
Search any
task