Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PRISM: Streaming Human Motion Generation with Per-Joint Latent Decomposition

About

Text-to-motion generation has advanced rapidly, yet two challenges persist. First, existing motion autoencoders compress each frame into a single monolithic latent vector, entangling trajectory and per-joint rotations in an unstructured representation that downstream generators struggle to model faithfully. Second, text-to-motion, pose-conditioned generation, and long-horizon sequential synthesis typically require separate models or task-specific mechanisms, with autoregressive approaches suffering from severe error accumulation over extended rollouts. We present PRISM, addressing each challenge with a dedicated contribution. (1) A joint-factorized motion latent space: each body joint occupies its own token, forming a structured 2D grid (time joints) compressed by a causal VAE with forward-kinematics supervision. This simple change to the latent space -- without modifying the generator -- substantially improves generation quality, revealing that latent space design has been an underestimated bottleneck. (2) Noise-free condition injection: each latent token carries its own timestep embedding, allowing conditioning frames to be injected as clean tokens (timestep0) while the remaining tokens are denoised. This unifies text-to-motion and pose-conditioned generation in a single model, and directly enables autoregressive segment chaining for streaming synthesis. Self-forcing training further suppresses drift in long rollouts. With these two components, we train a single motion generation foundation model that seamlessly handles text-to-motion, pose-conditioned generation, autoregressive sequential generation, and narrative motion composition, achieving state-of-the-art on HumanML3D, MotionHub, BABEL, and a 50-scenario user study.

Zeyu Ling, Qing Shuai, Teng Zhang, Shiyang Li, Bo Han, Changqing Zou• 2026

Related benchmarks

TaskDatasetResultRank
Motion GenerationMBench 16 (official leaderboard)
Jitter Penalty0.009
17
Text-to-motionHumanML3D 10 (test)
R-Precision@169.9
12
Text-to-motionMotionHub (test)
R-Precision (T1)53
12
Pose-conditioned motion generationHumanML3D
R-Precision (Top 3)0.902
10
Pose-conditioned motion generationMotionHub
R-P T30.775
10
Sequential action generationBABEL
R@358.7
5
Narrative motion composition50 diverse narrative prompts
Motion Quality (Good)74.3
1
Showing 7 of 7 rows

Other info

Follow for update