Share your thoughts, 1 month free Claude Pro on usSee more

Outcome Reasoning on Open-Critic

75.3M' (F1 Mean)

GPT-5

Updated 5mo ago

Evaluation Results

Method	Links
GPT-5 2025.05		75.3	69.4
GPT-o4 2025.05		73.8	67.5
Llama4-M 2025.05		61.5	54.7
DeepSeek 2025.05		57.2	50.6
Qwen3 2025.05		56	49.4
Gemini2.5 2025.05		53.7	47.3
Llama4-S 2025.05		45.8	39.2