Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Chatbot Evaluation on ArenaHard v2

57.4ArenaHard v2 Score

DeepSeek R1

43.46447.08250.754.318Sep 25, 2025
Updated 15d ago

Evaluation Results

MethodLinks
2025.09
57.4--
2025.09
55.6--
2025.09
54.2--
2025.09
50--
2025.09
47.5--
2025.09
44--
2026.01
-1413.7
-11.28.9
2026.01
-1210.8
2026.01
-12.311.1