Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs

About

Understanding human activities and their surrounding environments typically relies on visual perception, yet cameras pose persistent challenges in privacy, safety, energy efficiency, and scalability. We explore an alternative: 4D perception without vision. Its goal is to reconstruct human motion and 3D scene layouts purely from everyday wearable sensors. For this we introduce IMU-to-4D, a framework that repurposes large language models for non-visual spatiotemporal understanding of human-scene dynamics. IMU-to-4D uses data from a few inertial sensors from earbuds, watches, or smartphones and predicts detailed 4D human motion together with coarse scene structure. Experiments across diverse human-scene datasets show that IMU-to-4D yields more coherent and temporally stable results than SoTA cascaded pipelines, suggesting wearable motion sensors alone can support rich 4D understanding.

Hao-Yu Hsu, Tianhang Cheng, Jing Wen, Alexander G. Schwing, Shenlong Wang• 2026

Related benchmarks

Task	Dataset	Result
IMU-to-Motion	LINGO 3pt (test)	MPJPE39.86	5
IMU-to-Motion	DIP-IMU	MPJPE15.06	4
IMU-to-Motion	IMUPoser	MPJPE49.89	4
IMU-to-Motion	LINGO 5pt sensors	MPJPE30.49	4
3D Scene Generation	HUMOTO full length of IMU	3D IoU47.78	3
IMU-to-Motion	LINGO 37 (test)	MPJPE24.61	3
IMU-to-MotionText	LINGO	MPJPE30.49	3
IMU-to-MotionText	HumanML	MPJPE17.05	3
IMU-to-MotionText	HUMOTO	MPJPE26.04	3
IMU-to-MotionText	ParaHome	MPJPE27.56	3

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord