Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Multimodal Autoformalization on PHYX Electromagnetism
Loading...
0.667
Compile Success Rate
GPT-5
-0.02668
0.15341
0.3335
0.51359
Jan 6, 2026
Compile Success Rate
Semantic Correctness
Human Check Score
Updated 3d ago
Evaluation Results
Method
Method
Links
Compile Success Rate
Semantic Correctness
Human Check Score
GPT-5
Modality=Image
2026.01
0.667
0.5
0.5
GPT-5
Modality=Text
2026.01
0.5
0
-
Gemini-3-Pro
Modality=Image
2026.01
0.167
0.333
0.333
Gemini-3-Pro
Modality=Text
2026.01
0.167
0
-
Gemini-2.5-Pro
Modality=Image
2026.01
0
0
0
Gemini-2.5-Pro
Modality=Text
2026.01
0
0
-
Qwen3-VL-235B
Modality=Image
2026.01
0
0
0
Qwen3-VL-235B
Modality=Text
2026.01
0
0
-
Qwen2.5-VL-72B
Modality=Image
2026.01
0
0
0
Qwen2.5-VL-72B
Modality=Text
2026.01
0
0
-
Feedback
Search any
task
Search any
task