Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Multiple-Choice Question Answering on PhyX 3k (test)
Loading...
84.2
Exact Match Accuracy
Qwen3-VL-32B-Thinking
41.456
52.553
63.65
74.747
May 13, 2026
Exact Match Accuracy
Updated 19d ago
Evaluation Results
Method
Method
Links
Exact Match Accuracy
Qwen3-VL-32B-Thinking
2026.05
84.2
Claude Sonnet 4.5
max_tokens=16384
2026.05
80.6
Physics-R1 (dense)
max_tokens=16384, rewa...
2026.05
77.5
Physics-R1 (binary, seed 42)
max_tokens=16384, rewa...
2026.05
76.9
Physics-R1 (binary, 3-seed mean ±σ)
max_tokens=16384, rewa...
2026.05
76.9
Qwen3-VL-8B-Thinking (base)
2026.05
74.4
GPT-4o
source=Shen et al. [2025]
2026.05
53.6
Gemini 2.5 Pro
measured_by=authors
2026.05
49.8
InternVL3-8B
2026.05
43.1
Feedback
Search any
task
Search any
task