Decouple and Track: Benchmarking and Improving Video Diffusion Transformers for Motion Transfer

About

The motion transfer task aims to transfer motion from a source video to newly generated videos, requiring the model to decouple motion from appearance. Previous diffusion-based methods primarily rely on separate spatial and temporal attention mechanisms within the 3D U-Net. In contrast, state-of-the-art video Diffusion Transformers (DiT) models use 3D full attention, which does not explicitly separate temporal and spatial information. Thus, the interaction between spatial and temporal dimensions makes decoupling motion and appearance more challenging for DiT models. In this paper, we propose DeT, a method that adapts DiT models to improve motion transfer ability. Our approach introduces a simple yet effective temporal kernel to smooth DiT features along the temporal dimension, facilitating the decoupling of foreground motion from background appearance. Meanwhile, the temporal kernel effectively captures temporal variations in DiT features, which are closely related to motion. Moreover, we introduce explicit supervision along dense trajectories in the latent feature space to further enhance motion consistency. Additionally, we present MTBench, a general and challenging benchmark for motion transfer. We also introduce a hybrid motion fidelity metric that considers both the global and local motion similarity. Therefore, our work provides a more comprehensive evaluation than previous works. Extensive experiments on MTBench demonstrate that DeT achieves the best trade-off between motion fidelity and edit fidelity.

Qingyu Shi, Jianzong Wu, Jinbin Bai, Jiangning Zhang, Lu Qi, Yunhai Tong, Xiangtai Li• 2025

Related benchmarks

Task	Dataset	Result
Video Generation	VBench	--	126
Video Motion Transfer	Video Motion Transfer Dataset 50 videos 1.0 (test)	Text Similarity34	9
Motion Transfer	DAVIS Medium	CLIP Score0.3225	9
Motion Transfer	DAVIS Hard	CLIP Score0.3257	9
Motion Transfer	DAVIS (All subsets)	CLIP Score0.3201	9
Motion Transfer	DAVIS Easy	CLIP Score0.3149	9
Video Motion Transfer	DAVIS	Text Similarity21.87	8
Motion Transfer	DAVIS curated subset of 50 videos	CS Score32.01	7
Video Motion Transfer	User Study	Motion Fidelity Score3.874	5

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord