MotionMixer: MLP-based 3D Human Body Pose Forecasting
About
In this work, we present MotionMixer, an efficient 3D human body pose forecasting model based solely on multi-layer perceptrons (MLPs). MotionMixer learns the spatial-temporal 3D body pose dependencies by sequentially mixing both modalities. Given a stacked sequence of 3D body poses, a spatial-MLP extracts fine grained spatial dependencies of the body joints. The interaction of the body joints over time is then modelled by a temporal MLP. The spatial-temporal mixed features are finally aggregated and decoded to obtain the future motion. To calibrate the influence of each time step in the pose sequence, we make use of squeeze-and-excitation (SE) blocks. We evaluate our approach on Human3.6M, AMASS, and 3DPW datasets using the standard evaluation protocols. For all evaluations, we demonstrate state-of-the-art performance, while having a model with a smaller number of parameters. Our code is available at: https://github.com/MotionMLP/MotionMixer
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Human Motion Prediction | Human3.6M (test) | -- | 85 | |
| Human Motion Prediction | Human3.6M | -- | 46 | |
| Human Motion Prediction | 3DPW | Trajectory Error (400ms)22.8 | 27 | |
| 3D Human Motion Prediction | Human3.6M S5 (test) | Average MPJPE (560ms)46.1 | 17 | |
| 3D Human Pose Prediction | Human3.6M | Avg 3D Error (160ms)13.2 | 16 | |
| 3D Pose Forecasting (Joint Angles) | Human3.6M | MAE @ 80ms0.2 | 15 | |
| 3D Hand Pose Estimation | TED Hands (test) | L2 Error2.324 | 14 | |
| 3D Hand Gesture Generation | B2H dataset (test) | FHD2.169 | 8 | |
| 3D hand gesture sampling | TED Hands dataset (test) | FHD1.613 | 8 | |
| 3D hand prediction | B2H dataset (test) | L2 Error3.616 | 7 |