PRISM: Streaming Human Motion Generation with Per-Joint Latent Decomposition

About

Text-to-motion generation has advanced rapidly, yet two challenges persist. First, existing motion autoencoders compress each frame into a single monolithic latent vector, entangling trajectory and per-joint rotations in an unstructured representation that downstream generators struggle to model faithfully. Second, text-to-motion, pose-conditioned generation, and long-horizon sequential synthesis typically require separate models or task-specific mechanisms, with autoregressive approaches suffering from severe error accumulation over extended rollouts. We present PRISM, addressing each challenge with a dedicated contribution. (1) A joint-factorized motion latent space: each body joint occupies its own token, forming a structured 2D grid (time joints) compressed by a causal VAE with forward-kinematics supervision. This simple change to the latent space -- without modifying the generator -- substantially improves generation quality, revealing that latent space design has been an underestimated bottleneck. (2) Noise-free condition injection: each latent token carries its own timestep embedding, allowing conditioning frames to be injected as clean tokens (timestep0) while the remaining tokens are denoised. This unifies text-to-motion and pose-conditioned generation in a single model, and directly enables autoregressive segment chaining for streaming synthesis. Self-forcing training further suppresses drift in long rollouts. With these two components, we train a single motion generation foundation model that seamlessly handles text-to-motion, pose-conditioned generation, autoregressive sequential generation, and narrative motion composition, achieving state-of-the-art on HumanML3D, MotionHub, BABEL, and a 50-scenario user study.

Zeyu Ling, Qing Shuai, Teng Zhang, Shiyang Li, Bo Han, Changqing Zou• 2026

Related benchmarks

Task	Dataset	Result
Motion Generation	MBench 16 (official leaderboard)	Jitter Penalty0.009	17
Text-to-motion	HumanML3D 10 (test)	R-Precision@169.9	12
Text-to-motion	MotionHub (test)	R-Precision (T1)53	12
Pose-conditioned motion generation	HumanML3D	R-Precision (Top 3)0.902	10
Pose-conditioned motion generation	MotionHub	R-P T30.775	10
Sequential action generation	BABEL	R@358.7	5
Narrative motion composition	50 diverse narrative prompts	Motion Quality (Good)74.3	1

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord