Share your thoughts, 1 month free Claude Pro on usSee more

Outcome Reasoning on HumanEval Exe

75.7M' (F1 Mean)

GPT-5

Updated 5mo ago

Evaluation Results

Method	Links
GPT-5 2025.05		75.7	71.5
GPT-o4 2025.05		73.4	66.5
Llama4-M 2025.05		63.6	56.9
DeepSeek 2025.05		59.4	52.7
Qwen3 2025.05		58.2	51.5
Gemini2.5 2025.05		55.8	49.4
Llama4-S 2025.05		47.7	41.2