Share your thoughts, 1 month free Claude Pro on usSee more

Outcome Reasoning on MalAlgoQA

85.1M' (F1 Mean)

GPT-5

Updated 5mo ago

Evaluation Results

Method	Links
GPT-5 2025.05		85.1	79.6
GPT-o4 2025.05		83.6	77.8
Llama4-M 2025.05		74	68.2
DeepSeek 2025.05		71.8	65.1
Gemini2.5 2025.05		69.6	62.9
Qwen3 2025.05		67.5	60.9
Llama4-S 2025.05		57.4	50.7