Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Open-ended Question Answering on PHYSOLYM-A v1 (held-out)
Loading...
33.4
Problem-level Score
Claude Sonnet 4.5
2.824
10.762
18.7
26.638
May 13, 2026
Problem-level Score
Updated 19d ago
Evaluation Results
Method
Method
Links
Problem-level Score
Claude Sonnet 4.5
max_tokens=16384
2026.05
33.4
Physics-R1 (binary, 3-seed mean ±σ)
max_tokens=16384, rewa...
2026.05
26.3
Physics-R1 (binary, seed 42)
max_tokens=16384, rewa...
2026.05
25.6
GPT-4o
2026.05
19.5
Physics-R1 (dense)
max_tokens=16384, rewa...
2026.05
19.2
Qwen3-VL-32B-Thinking
2026.05
13.2
Gemini 2.5 Pro
2026.05
12.2
Qwen3-VL-8B-Thinking (base)
2026.05
8
InternVL3-8B
2026.05
4
Feedback
Search any
task
Search any
task