Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Open-ended Question Answering on PUB-OE v3 (test)
Loading...
37.7
Subpart AND (v3)
Physics-R1 (dense)
22.932
26.766
30.6
34.434
May 13, 2026
Subpart AND (v3)
Updated 19d ago
Evaluation Results
Method
Method
Links
Subpart AND (v3)
Physics-R1 (dense)
max_tokens=16384, rewa...
2026.05
37.7
Physics-R1 (binary, seed 42)
max_tokens=16384, rewa...
2026.05
37
Qwen3-VL-8B-Thinking (base)
2026.05
35.3
Physics-R1 (binary, 3-seed mean ±σ)
max_tokens=16384, rewa...
2026.05
34.8
Gemini 2.5 Pro
2026.05
33.4
Qwen3-VL-32B-Thinking
2026.05
32.8
GPT-4o
2026.05
31
Claude Sonnet 4.5
max_tokens=16384
2026.05
25.4
InternVL3-8B
2026.05
23.5
Feedback
Search any
task
Search any
task