TEMOS: Generating diverse human motions from textual descriptions
About
We address the problem of generating diverse 3D human motions from textual descriptions. This challenging task requires joint modeling of both modalities: understanding and extracting useful human-centric information from the text, and then generating plausible and realistic sequences of human poses. In contrast to most previous work which focuses on generating a single, deterministic, motion from a textual description, we design a variational approach that can produce multiple diverse human motions. We propose TEMOS, a text-conditioned generative model leveraging variational autoencoder (VAE) training with human motion data, in combination with a text encoder that produces distribution parameters compatible with the VAE latent space. We show the TEMOS framework can produce both skeleton-based animations as in prior work, as well more expressive SMPL body motions. We evaluate our approach on the KIT Motion-Language benchmark and, despite being relatively straightforward, demonstrate significant improvements over the state of the art. Code and models are available on our webpage.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-motion generation | HumanML3D (test) | FID3.734 | 331 | |
| text-to-motion mapping | KIT-ML (test) | R Precision (Top 3)67 | 275 | |
| text-to-motion mapping | HumanML3D (test) | FID3.734 | 243 | |
| Motion-to-text retrieval | KIT-ML (test) | R@141.88 | 41 | |
| Motion-to-text retrieval | HumanML3D (test) | R@139.96 | 27 | |
| Text-to-motion retrieval | HumanML3D (test) | R@140.49 | 27 | |
| Interactive Motion Synthesis | InterHuman (test) | R Precision (Top 1)22.4 | 25 | |
| Text-to-motion retrieval | HumanML3D 1.0 (test) | R@140.49 | 24 | |
| Motion-to-text retrieval | HumanML3D 1.0 (test) | R@139.96 | 24 | |
| Human-human interaction motion generation | InterHuman | FID17.375 | 23 |