Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation

About

Fine-tuning large-scale text-to-video diffusion models to add new generative controls, such as those over physical camera parameters (e.g., shutter speed or aperture), typically requires vast, high-fidelity datasets that are difficult to acquire. In this work, we propose a data-efficient fine-tuning strategy that learns these controls from sparse, low-quality synthetic data. We show that not only does fine-tuning on such simple data enable the desired controls, it actually yields superior results to models fine-tuned on photorealistic "real" data. Beyond demonstrating these results, we provide a framework that justifies this phenomenon both intuitively and quantitatively.

Shihan Cheng, Nilesh Kulkarni, David Hyde, Dmitriy Smirnov• 2025

Related benchmarks

Task	Dataset	Result	Rank
Controllable Video Generation	VBench (test)	Motion Smoothness99.4		9

Showing 1 of 1 rows

Other info

Follow for update

@wizwand_team Discord