Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs

About

Understanding human activities and their surrounding environments typically relies on visual perception, yet cameras pose persistent challenges in privacy, safety, energy efficiency, and scalability. We explore an alternative: 4D perception without vision. Its goal is to reconstruct human motion and 3D scene layouts purely from everyday wearable sensors. For this we introduce IMU-to-4D, a framework that repurposes large language models for non-visual spatiotemporal understanding of human-scene dynamics. IMU-to-4D uses data from a few inertial sensors from earbuds, watches, or smartphones and predicts detailed 4D human motion together with coarse scene structure. Experiments across diverse human-scene datasets show that IMU-to-4D yields more coherent and temporally stable results than SoTA cascaded pipelines, suggesting wearable motion sensors alone can support rich 4D understanding.

Hao-Yu Hsu, Tianhang Cheng, Jing Wen, Alexander G. Schwing, Shenlong Wang• 2026

Related benchmarks

TaskDatasetResultRank
IMU-to-MotionLINGO 3pt (test)
MPJPE39.86
5
IMU-to-MotionDIP-IMU
MPJPE15.06
4
IMU-to-MotionIMUPoser
MPJPE49.89
4
IMU-to-MotionLINGO 5pt sensors
MPJPE30.49
4
3D Scene GenerationHUMOTO full length of IMU
3D IoU47.78
3
IMU-to-MotionLINGO 37 (test)
MPJPE24.61
3
IMU-to-MotionTextLINGO
MPJPE30.49
3
IMU-to-MotionTextHumanML
MPJPE17.05
3
IMU-to-MotionTextHUMOTO
MPJPE26.04
3
IMU-to-MotionTextParaHome
MPJPE27.56
3
Showing 10 of 10 rows

Other info

Follow for update