Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Open-ended Question Answering on PhysReason v2 (test)
Loading...
51.1
Subpart-AND (v2)
GPT-4o
11.788
21.994
32.2
42.406
May 13, 2026
Subpart-AND (v2)
Updated 19d ago
Evaluation Results
Method
Method
Links
Subpart-AND (v2)
GPT-4o
2026.05
51.1
Claude Sonnet 4.5
max_tokens=16384
2026.05
49.1
Physics-R1 (binary, 3-seed mean ±σ)
max_tokens=16384, rewa...
2026.05
39.6
Gemini 2.5 Pro
2026.05
38.8
Physics-R1 (binary, seed 42)
max_tokens=16384, rewa...
2026.05
32.2
Qwen3-VL-32B-Thinking
2026.05
25.1
Qwen3-VL-8B-Thinking (base)
2026.05
23.9
Physics-R1 (dense)
max_tokens=16384, rewa...
2026.05
23.3
InternVL3-8B
2026.05
13.3
Feedback
Search any
task
Search any
task