Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Exploring Physical Intelligence Emergence via Omni-Modal Architecture and Physical Data Engine

About

Physical understanding remains brittle in omni-modal models because key physical attributes are visually ambiguous and sparsely represented in web-scale data. We present OmniFysics, a compact omni-modal model that unifies understanding across images, audio, video, and text, with integrated speech and image generation. To inject explicit physical knowledge, we build a physical data engine with two components. FysicsAny produces physics-grounded instruction--image supervision by mapping salient objects to verified physical attributes through hierarchical retrieval over a curated prototype database, followed by physics-law--constrained verification and caption rewriting. FysicsOmniCap distills web videos via audio--visual consistency filtering to generate high-fidelity video--instruction pairs emphasizing cross-modal physical cues. We train OmniFysics with staged multimodal alignment and instruction tuning, adopt latent-space flow matching for text-to-image generation, and use an intent router to activate generation only when needed. Experiments show competitive performance on standard multimodal benchmarks and improved results on physics-oriented evaluations.

Minghao Han, Dingkang Yang, Yue Jiang, Yizhou Liu, Lihua Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Text-to-Image GenerationGenEval
GenEval Score63
277
Video UnderstandingVideoMME
Overall Score63.8
192
Text-to-Image GenerationDPG-Bench
DPG Score76.49
89
Audio UnderstandingMMAR
MMAR56.8
12
Physical PerceptionPAI-Bench
PAI-Bench Score57.7
9
Physical PerceptionQuantiPhy
QuantiPhy Score38.5
9
Physical PerceptionFysicsEval
Prediction Score32.6
9
Physical PerceptionPhysBench
PhysBench Score47.2
9
Physical PerceptionPhysUniBench
PhysUniBench Score50.8
9
Omni-modal UnderstandingOmniBench
Overall Score47.27
8
Showing 10 of 13 rows

Other info

Follow for update