Natural Human Motion Recovery by Aligning High-Order Temporal Dynamics from Monocular Videos

About

Human motion recovered from monocular videos often appears overly smooth or dynamically inconsistent, even when joint positions are numerically accurate. We observe that this limitation stems from the absence of reliable high-order temporal cues -- velocity and acceleration -- which are essential for reconstructing motion that exhibits realistic momentum, timing, and high-frequency detail. We introduce HTD-Refine, a post-processing framework that augments existing Human Motion Recovery (HMR) pipelines using explicitly estimated high-order temporal dynamics. At the core of our system is PVA-Net, a temporal transformer that infers per-joint 2D positions, 3D velocities, and 3D accelerations directly from a monocular video. These predicted dynamics serve as soft yet informative constraints in a global optimization procedure that refines world-space trajectories, significantly reducing jitter, suppressing over-smoothing, and restoring physically plausible motion. Extensive experiments on challenging in-the-wild benchmarks show that HTD-Refine consistently improves state-of-the-art HMR methods, yielding more accurate global trajectories and substantially more natural motion dynamics. Our results highlight the critical role of high-order temporal modeling in advancing monocular human motion recovery.

Dingkun Wei, Zehong Shen, Yan Xia, Georgios Pavlakos, Yujun Shen, Xiaowei Zhou• 2026

Related benchmarks

Task	Dataset	Result
Global human motion estimation	EMDB 2	WA-MPJPE69.2	18
3D Human Motion Reconstruction	RICH static cameras (test)	Jitter3.6	4
Human Mesh Recovery	H36M (test)	Jitter3.6	4

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord