MotionBooth: Motion-Aware Customized Text-to-Video Generation

About

In this work, we present MotionBooth, an innovative framework designed for animating customized subjects with precise control over both object and camera movements. By leveraging a few images of a specific object, we efficiently fine-tune a text-to-video model to capture the object's shape and attributes accurately. Our approach presents subject region loss and video preservation loss to enhance the subject's learning performance, along with a subject token cross-attention loss to integrate the customized subject with motion control signals. Additionally, we propose training-free techniques for managing subject and camera motions during inference. In particular, we utilize cross-attention map manipulation to govern subject motion and introduce a novel latent shift module for camera movement control as well. MotionBooth excels in preserving the appearance of subjects while simultaneously controlling the motions in generated videos. Extensive quantitative and qualitative evaluations demonstrate the superiority and effectiveness of our method. Our project page is at https://jianzongwu.github.io/projects/motionbooth

Jianzong Wu, Xiangtai Li, Yanhong Zeng, Jiangning Zhang, Qianyu Zhou, Yining Li, Yunhai Tong, Kai Chen• 2024

Related benchmarks

Task	Dataset	Result
Motion-aware customized video generation	Customized Video Generation Evaluation Set	R-CLIP71.2	8
Camera Movement Control	MSRVTT 1000 random videos (test)	FVD723.3	6
Text-to-Video Generation	HumanVid 500 real-world videos (curated evaluation set)	LPIPS0.69	5
Subject Motion Control	Subject Motion Control Evaluation Set (test)	R-CLIP Score0.767	4
Customized Video Generation	Customized Video Generation Dataset (test)	CLIP-T0.301	4
Video Generation	HECTOR Single-Object (Evaluation Set)	R-CLIP62.77	4

Showing 6 of 6 rows

Other info

Code

Follow for update

@wizwand_team Discord