Target-Aware Video Diffusion Models

About

We present a target-aware video diffusion model that generates videos from an input image, in which an actor interacts with a specified target while performing a desired action. The target is defined by a segmentation mask, and the action is described through a text prompt. Our key motivation is to incorporate target awareness into video generation, enabling actors to perform directed actions on designated objects. This enables video diffusion models to act as motion planners, producing plausible predictions of human-object interactions by leveraging the priors of large-scale video generative models. We build our target-aware model by extending a baseline model to incorporate the target mask as an additional input. To enforce target awareness, we introduce a special token that encodes the target's spatial information within the text prompt. We then fine-tune the model with our curated dataset using an additional cross-attention loss that aligns the cross-attention maps associated with this token with the input target mask. To further improve performance, we selectively apply this loss to the most semantically relevant attention regions and transformer blocks. Experimental results show that our target-aware model outperforms existing solutions in generating videos where actors interact accurately with the specified targets. We further demonstrate its efficacy in two downstream applications: zero-shot 3D HOI motion synthesis with physical plausibility and long-term video content creation.

Taeksoo Kim, Hanbyul Joo• 2025

Related benchmarks

Task	Dataset	Result
Video Generation	VBench	--	126
Image-to-Video Generation	VBench	Motion Smoothness0.991	46
Image-to-Video Generation	InterGenEval (Synthetic (60 pairs) and Real (58 pairs))	KISA0.465	5
Controllable Text-to-Video Generation	Targeting Evaluation Dataset	Contact Score0.878	4
Target-Oriented Video Generation	80 Interaction Scenes	Contact Score0.878	4

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord