Exploring Physical Intelligence Emergence via Omni-Modal Architecture and Physical Data Engine
About
Physical understanding remains brittle in omni-modal models because key physical attributes are visually ambiguous and sparsely represented in web-scale data. We present OmniFysics, a compact omni-modal model that unifies understanding across images, audio, video, and text, with integrated speech and image generation. To inject explicit physical knowledge, we build a physical data engine with two components. FysicsAny produces physics-grounded instruction--image supervision by mapping salient objects to verified physical attributes through hierarchical retrieval over a curated prototype database, followed by physics-law--constrained verification and caption rewriting. FysicsOmniCap distills web videos via audio--visual consistency filtering to generate high-fidelity video--instruction pairs emphasizing cross-modal physical cues. We train OmniFysics with staged multimodal alignment and instruction tuning, adopt latent-space flow matching for text-to-image generation, and use an intent router to activate generation only when needed. Experiments show competitive performance on standard multimodal benchmarks and improved results on physics-oriented evaluations.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Image Generation | GenEval | GenEval Score63 | 277 | |
| Video Understanding | VideoMME | Overall Score63.8 | 192 | |
| Text-to-Image Generation | DPG-Bench | DPG Score76.49 | 89 | |
| Audio Understanding | MMAR | MMAR56.8 | 12 | |
| Physical Perception | PAI-Bench | PAI-Bench Score57.7 | 9 | |
| Physical Perception | QuantiPhy | QuantiPhy Score38.5 | 9 | |
| Physical Perception | FysicsEval | Prediction Score32.6 | 9 | |
| Physical Perception | PhysBench | PhysBench Score47.2 | 9 | |
| Physical Perception | PhysUniBench | PhysUniBench Score50.8 | 9 | |
| Omni-modal Understanding | OmniBench | Overall Score47.27 | 8 |