ARMFlow: AutoRegressive MeanFlow for Online 3D Human Reaction Generation
About
3D human reaction generation faces three main challenges:(1) high motion fidelity, (2) real-time inference, and (3) autoregressive adaptability for online scenarios. Existing methods fail to meet all three simultaneously. We propose ARMFlow, a MeanFlow-based autoregressive framework that models temporal dependencies between actor and reactor motions. It consists of a causal context encoder and an MLP-based velocity predictor. We introduce Bootstrap Contextual Encoding (BSCE) in training, encoding generated history instead of the ground-truth ones, to alleviate error accumulation in autoregressive generation. We further introduce the offline variant ReMFlow, achieving state-of-the-art performance with the fastest inference among offline methods. Our ARMFlow addresses key limitations of online settings by: (1) enhancing semantic alignment via a global contextual encoder; (2) achieving high accuracy and low latency in a single-step inference; and (3) reducing accumulated errors through BSCE. Our single-step online generation surpasses existing online methods on InterHuman and InterX by over 40% in FID, while matching offline state-of-the-art performance despite using only partial sequence conditions.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Human-human interaction motion generation | InterHuman | FID2.433 | 23 | |
| text-conditioned human interaction generation | InterHuman (test) | R Precision (Top 1)44.1 | 12 | |
| text-conditioned human interaction generation | InterX (test) | R-Precision (Top 1)42 | 10 | |
| Motion Generation | InterX | FID0.058 | 6 |