Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AnyLift: Scaling Motion Reconstruction from Internet Videos via 2D Diffusion

About

Reconstructing 3D human motion and human-object interactions (HOI) from Internet videos is a fundamental step toward building large-scale datasets of human behavior. Existing methods struggle to recover globally consistent 3D motion under dynamic cameras, especially for motion types underrepresented in current motion-capture datasets, and face additional difficulty recovering coherent human-object interactions in 3D. We introduce a two-stage framework leveraging 2D diffusion that reconstructs 3D human motion and HOI from Internet videos. In the first stage, we synthesize multi-view 2D motion data for each domain, leveraging 2D keypoints extracted from Internet videos to incorporate human motions that rarely appear in existing MoCap datasets. In the second stage, a camera-conditioned multi-view 2D motion diffusion model is trained on the domain-specific synthetic data to recover 3D human motion and 3D HOI in the world space. We demonstrate the effectiveness of our method on Internet videos featuring challenging motions such as gymnastics, as well as in-the-wild HOI videos, and show that it outperforms prior work in producing realistic human motion and human-object interaction.

Hongjie Li, Heng Yu, Jiaman Li, Hong-Xing Yu, Ehsan Adeli, C. Karen Liu, Jiajun Wu• 2026

Related benchmarks

TaskDatasetResultRank
Human Motion ReconstructionAIST++ v1.1 (test)
J2D Error16.6
8
Human Motion ReconstructionCollected Internet videos (Gymnastics)
J2D Error21.6
5
Human Motion ReconstructionCollected Internet videos Martial Arts
J2D Error15.1
5
Human StudyCollected Internet Videos
Ground Contact84.2
5
Human-Object Interaction ReconstructionBEHAVE Box, static-camera setup
Root Translation Error (T_root)24.61
3
Human-Object Interaction ReconstructionBEHAVE Chair, static-camera setup
T_root22.48
3
Human-Object Interaction ReconstructionBEHAVE Table, static-camera setup
Root Translation Error (mm)26.05
3
Human-Object Interaction ReconstructionBEHAVE Box, dynamic-camera setup
T_root Error29.99
2
Human-Object Interaction ReconstructionBEHAVE Chair, dynamic-camera setup
T_root Error23.87
2
Human-Object Interaction ReconstructionBEHAVE Table, dynamic-camera setup
Root Translation Error (T_root)28.09
2
Showing 10 of 10 rows

Other info

Follow for update