MAMMA: Markerless & Automatic Multi-Person Motion Action Capture
About
We present MAMMA, a markerless motion-capture pipeline that accurately recovers SMPL-X parameters from multi-view video of two-person interaction sequences. Traditional motion-capture systems rely on physical markers. Although they offer high accuracy, their requirements of specialized hardware, manual marker placement, and extensive post-processing make them costly and time-consuming. Recent learning-based methods attempt to overcome these limitations, but most are designed for single-person capture, rely on sparse keypoints, or struggle with occlusions and physical interactions. In this work, we introduce a method that predicts dense 2D contact-aware surface landmarks conditioned on segmentation masks, enabling person-specific correspondence estimation even under heavy occlusion. We employ a novel architecture that exploits learnable queries for each landmark. We demonstrate that our approach can handle complex person--person interaction and offers greater accuracy than existing methods. To train our network, we construct a large, synthetic multi-view dataset combining human motions from diverse sources, including extreme poses, hand motions, and close interactions. Our dataset yields high-variability synthetic sequences with rich body contact and occlusion, and includes SMPL-X ground-truth annotations with dense 2D landmarks. The result is a system capable of capturing human motion without the need for markers. Our approach offers competitive reconstruction quality compared to commercial marker-based motion-capture solutions, without the extensive manual cleanup. Finally, we address the absence of common benchmarks for dense-landmark prediction and markerless motion capture by introducing two evaluation settings built from real multi-view sequences. Our dataset is available in https://mamma.is.tue.mpg.de for research purposes.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Human Mesh Recovery | MoYo | MPJPE22.95 | 16 | |
| 3D Human Pose Estimation | Chi3D | MPJPE37.96 | 15 | |
| Human Mesh Recovery | RICH | -- | 13 | |
| Human Pose Estimation | Harmony4D | PVE34.02 | 9 | |
| 3D Human Pose Estimation | Hi4D (test) | MPJPE12.44 | 8 | |
| 3D human mesh fitting | MammaEval-S | MPJPE12.96 | 5 | |
| 3D human mesh fitting | MammaEval-D | MPJPE17.71 | 5 | |
| 3D human reconstruction | Harmony4D + CHI3D + MammaEval-D (test) | Mean Perceptual Depth (mm)8.46 | 5 | |
| 2D Landmark Prediction | RICH | Mean 2D Euclidean Distance Error (pixels)8.55 | 4 | |
| 2D Landmark Prediction | MoYo | Mean 2D Euclidean Distance Error11.04 | 4 |