PhysBrain 1.0 Technical Report
About
Vision-language-action models have advanced rapidly, but robot trajectories alone provide limited coverage for learning broad physical understanding. PhysBrain 1.0 studies a complementary route: converting large-scale human egocentric video into structured physical commonsense supervision before robot adaptation. Our data engine extracts scene elements, spatial dynamics, action execution, and depth-aware relations, then turns them into question-answer supervision for training PhysBrain VLMs. The resulting physical priors are further transferred to VLA policies through a capability-preserving and language-sensitive adaptation design. Across multimodal QA benchmarks and embodied control benchmarks, including ERQA, PhysBench, SimplerEnv-WidowX, LIBERO, and RoboCasa, PhysBrain 1.0 achieves SOTA results and shows especially strong out-of-domain performance on SimplerEnv. These results suggest that scaling physical commonsense from human interaction video can provide an effective bridge from multimodal understanding to robot action.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Robot Manipulation | LIBERO | Spatial Success Rate99.6 | 116 | |
| Robot Manipulation | SimplerEnv WidowX | Success Rate: Put Spoon on Towel95.8 | 98 | |
| Robotic Manipulation | RoboCasa GR1 Tabletop Manipulation (test) | PnP Bottle To Cabinet Close76 | 12 | |
| Robot Manipulation | SimplerEnv GoogleRobot (out-of-domain) | Success Rate (Pick Coke Can)100 | 6 |