Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Chatbot Evaluation on ArenaHard v2
Loading...
57.4
ArenaHard v2 Score
DeepSeek R1
43.464
47.082
50.7
54.318
Sep 25, 2025
ArenaHard v2 Score
Hard Prompt Accuracy
Creative Writing Accuracy
Updated 15d ago
Evaluation Results
Method
Method
Links
ArenaHard v2 Score
Hard Prompt Accuracy
Creative Writing Accuracy
DeepSeek R1
Input Cost (per M toke...
2025.09
57.4
-
-
Qwen3-32B + RLBFF training
Input Cost (per M toke...
2025.09
55.6
-
-
Claude-3.7-Sonnet (Thinking)
Input Cost (per M toke...
2025.09
54.2
-
-
o3-mini
Input Cost (per M toke...
2025.09
50
-
-
Qwen3-32B + Baseline BT training
Input Cost (per M toke...
2025.09
47.5
-
-
Qwen3-32B
Input Cost (per M toke...
2025.09
44
-
-
Base
2026.01
-
14
13.7
SFT on self-teacher
2026.01
-
11.2
8.9
GRPO
2026.01
-
12
10.8
SDPO
2026.01
-
12.3
11.1
Feedback
Search any
task
Search any
task