ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation

About

Recent video diffusion models achieve high-quality generation through recurrent frame processing where each frame generation depends on previous frames. However, this recurrent mechanism means that training such models in the pixel domain incurs prohibitive memory costs, as activations accumulate across the entire video sequence. This fundamental limitation also makes fine-tuning these models with pixel-wise losses computationally intractable for long or high-resolution videos. This paper introduces ChopGrad, a truncated backpropagation scheme for video decoding, limiting gradient computation to local frame windows while maintaining global consistency. We provide a theoretical analysis of this approximation and show that it enables efficient fine-tuning with frame-wise losses. ChopGrad reduces training memory from scaling linearly with the number of video frames (full backpropagation) to constant memory, and compares favorably to existing state-of-the-art video diffusion models across a suite of conditional video generation tasks with pixel-wise losses, including video super-resolution, video inpainting, video enhancement of neural-rendered scenes, and controlled driving video generation.

Dmitriy Rivkin, Parker Ewen, Lili Gao, Julian Ost, Stefanie Walz, Rasika Kangutkar, Mario Bijelic, Felix Heide• 2026

Related benchmarks

Task	Dataset	Result
Video Super-Resolution	UDM10	PSNR26.7	88
Video Super-Resolution	SPMCS	PSNR23.67	61
Video Super-Resolution	MVSR4x	PSNR22.55	49
Video Super-Resolution	RealVSR	PSNR22.43	28
Video Super-Resolution	YouHQ40	PSNR24.58	18
Neural Novel View Synthesis	DL3DV-Benchmark (test)	FID11.209	8
Controlled Driving Video Generation	Waymo Open Dataset	PSNR29.49	2
Video Inpainting	DL3DV	FID40.948	2
Video Inpainting	Waymo	FID27.057	2
Video Inpainting	ROVI	FID27.547	2

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord