UMO: Unified In-Context Learning Unlocks Motion Foundation Model Priors

About

Large-scale foundation models (LFMs) have recently made impressive progress in text-to-motion generation by learning strong generative priors from massive 3D human motion datasets and paired text descriptions. However, how to effectively and efficiently leverage such single-purpose motion LFMs, i.e., text-to-motion synthesis, in more diverse cross-modal and in-context motion generation downstream tasks remains largely unclear. Prior work typically adapts pretrained generative priors to individual downstream tasks in a task-specific manner. In contrast, our goal is to unlock such priors to support a broad spectrum of downstream motion generation tasks within a single unified framework. To bridge this gap, we present UMO, a simple yet general unified formulation that casts diverse downstream tasks into compositions of atomic per-frame operations, enabling in-context adaptation to unlock the generative priors of pretrained DiT-based motion LFMs. Specifically, UMO introduces three learnable frame-level meta-operation embeddings to specify per-frame intent and employs lightweight temporal fusion to inject in-context cues into the pretrained backbone, with negligible runtime overhead compared to the base model. With this design, UMO finetunes the pretrained model, originally limited to text-to-motion generation, to support diverse previously unsupported tasks, including temporal inpainting, text-guided motion editing, text-serialized geometric constraints, and multi-identity reaction generation. Experiments demonstrate that UMO consistently outperforms task-specific and training-free baselines across a wide range of benchmarks, despite using a single unified model. Code and model will be publicly available. Project Page: https://oliver-cong02.github.io/UMO.github.io/

Xiaoyan Cong, Zekun Li, Zhiyang Dou, Hongyu Li, Omid Taheri, Chuan Guo, Abhay Mittal, Sizhe An, Taku Komura, Wojciech Matusik, Michael J. Black, Srinath Sridhar• 2026

Related benchmarks

Task	Dataset	Result
Temporal Inpainting (Backcasting)	HumanML3D	MPJPE2.06	10
Temporal Inpainting (Prediction)	HumanML3D	MPJPE8.55	10
Instruction-Based Motion Editing	MotionFix (Batch)	R@198.08	10
Instruction-Based Motion Editing	MotionFix (Full)	R@161.7	9
Geometric-Constrained Motion Generation	Geometric-Constrained Generation	Trajectory Error18.78	8
Text-to-motion	HumanML3D official evaluator from MotionStreamer (test)	FID9.46	7
Reaction Generation	InterHuman	FID2.055	4

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord