MotionChain: Conversational Motion Controllers via Multimodal Prompts

About

Recent advancements in language models have demonstrated their adeptness in conducting multi-turn dialogues and retaining conversational context. However, this proficiency remains largely unexplored in other multimodal generative models, particularly in human motion models. By integrating multi-turn conversations in controlling continuous virtual human movements, generative human motion models can achieve an intuitive and step-by-step process of human task execution for humanoid robotics, game agents, or other embodied systems. In this work, we present MotionChain, a conversational human motion controller to generate continuous and long-term human motion through multimodal prompts. Specifically, MotionChain consists of multi-modal tokenizers that transform various data types such as text, image, and motion, into discrete tokens, coupled with a Vision-Motion-aware Language model. By leveraging large-scale language, vision-language, and vision-motion data to assist motion-related generation tasks, MotionChain thus comprehends each instruction in multi-turn conversation and generates human motions followed by these prompts. Extensive experiments validate the efficacy of MotionChain, demonstrating state-of-the-art performance in conversational motion generation, as well as more intuitive manners of controlling and interacting with virtual humans.

Biao Jiang, Xin Chen, Chi Zhang, Fukun Yin, Zhuoyuan Li, Gang YU, Jiayuan Fan• 2024

Related benchmarks

Task	Dataset	Result
Text-to-motion generation	HumanML3D (test)	FID0.248	553
text-to-motion mapping	HumanML3D (test)	FID0.248	283
Text-driven Motion Generation	HumanML3D (test)	R-Precision@150.4	54
Motion-to-Text	HumanML3D (test)	BLEU@412.56	48
Text-to-Motion Synthesis	HumanML3D	R-Precision (Top 1)50.4	43
Motion Description	HumanML3D (test)	BLEU-148.1	27

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord