Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Multimodal Physical Reasoning on WMW-TRACEBANK synthetic controlled split external-transfer pool
Loading...
76
Answer Accuracy
Claude Opus 4.7
40.64
49.82
59
68.18
May 28, 2026
Answer Accuracy
State Accuracy
Transition Accuracy
Trace-Answer Accuracy
HIR Accuracy
Revision Improvement (pp)
Rerank Improvement (pp)
Updated 5d ago
Evaluation Results
Method
Method
Links
Answer Accuracy
State Accuracy
Transition Accuracy
Trace-Answer Accuracy
HIR Accuracy
Revision Improvement (pp)
Rerank Improvement (pp)
Claude Opus 4.7
2026.05
76
81
68
91
18
3
5
GPT-5.5
2026.05
72
77
61
88
24
4
6
GPT-4o
2026.05
63
68
49
83
31
4
7
Qwen2.5-VL-72B
2026.05
60
65
47
82
33
3
6
InternVL3-78B
2026.05
58
63
44
80
34
3
5
GPT-4o-mini
2026.05
52
55
38
78
35
2
4
Qwen2.5-VL-7B
2026.05
42
46
30
72
42
2
3
Feedback
Search any
task
Search any
task