PhyX: Does Your Model Have the "Wits" for Physical Reasoning?
About
Existing benchmarks fail to capture a crucial aspect of intelligence: physical reasoning, the integrated ability to combine domain knowledge, symbolic reasoning, and understanding of real-world constraints. To address this gap, we introduce PhyX: the first large-scale benchmark designed to assess models capacity for physics-grounded reasoning in visual scenarios. PhyX includes 3K meticulously curated multimodal questions spanning 6 reasoning types across 25 sub-domains and 6 core physics domains: thermodynamics, electromagnetism, mechanics, modern physics, optics, and wave\&acoustics. In our comprehensive evaluation, even state-of-the-art models struggle significantly with physical reasoning. GPT-4o, Claude3.7-Sonnet, and GPT-o4-mini achieve only 32.5%, 42.2%, and 45.8% accuracy respectively-performance gaps exceeding 29% compared to human experts. Our analysis exposes critical limitations in current models: over-reliance on memorized disciplinary knowledge, excessive dependence on mathematical formulations, and surface-level visual pattern matching rather than genuine physical understanding. We provide in-depth analysis through fine-grained statistics, detailed case studies, and multiple evaluation paradigms to thoroughly examine physical reasoning capabilities. To ensure reproducibility, we implement a compatible evaluation protocol based on widely-used toolkits such as VLMEvalKit, enabling one-click evaluation. More details are available on our project page: https://phyx-bench.github.io/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Open-ended Question Answering | PhysReason v2 (test) | Subpart-AND (v2)51.1 | 9 | |
| Open-ended Question Answering | PHYSOLYM-A v1 (held-out) | Problem-level Score19.5 | 9 | |
| Multiple-choice Question Answering | PhyX 3k (test) | Exact Match Accuracy53.6 | 9 | |
| Open-ended Question Answering | PUB-OE v3 (test) | Subpart AND (v3)31 | 9 | |
| Multiple-choice Question Answering | PhyX 1k (test) | MCQ Exact Accuracy70.4 | 9 | |
| Open-ended Question Answering | OlymBench Phys v1 (test) | Problem Level Score19.7 | 9 |