Share your thoughts, 1 month free Claude Pro on usSee more

Long-horizon task execution on LongAct Bench (detailed split)

59GC (Avg)

GPT-5

Updated 2mo ago

Evaluation Results

Method	Links
GPT-5 2026.05		59	66.7	82.9	75.2	60	16	1,982	25.3	1.7
Qwen3-VL-32B 2026.05		51.2	52.7	71.2	73.4	25	15	1,692	28.7	1.61
GPT-5-mini 2026.05		38.4	40.5	58.9	66.7	25	9	2,304	21.5	0.81
Qwen3-VL-8B 2026.05		24.5	27.4	55	48.7	31.3	3	1,896	30.6	0.99
Qwen3-VL-2B 2026.05		7.16	7.04	8.77	37.8	0	0	844	4.81	0.59
Qwen3-VL-32B 2026.05		6.14	3.54	18.8	7.32	0	0	2,981	3.1	-0.32
Qwen3-VL-8B 2026.05		0.74	0.15	6.93	0	0	0	2,598	0.26	-0.08