Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation

About

Recent video diffusion models achieve high-quality generation through recurrent frame processing where each frame generation depends on previous frames. However, this recurrent mechanism means that training such models in the pixel domain incurs prohibitive memory costs, as activations accumulate across the entire video sequence. This fundamental limitation also makes fine-tuning these models with pixel-wise losses computationally intractable for long or high-resolution videos. This paper introduces ChopGrad, a truncated backpropagation scheme for video decoding, limiting gradient computation to local frame windows while maintaining global consistency. We provide a theoretical analysis of this approximation and show that it enables efficient fine-tuning with frame-wise losses. ChopGrad reduces training memory from scaling linearly with the number of video frames (full backpropagation) to constant memory, and compares favorably to existing state-of-the-art video diffusion models across a suite of conditional video generation tasks with pixel-wise losses, including video super-resolution, video inpainting, video enhancement of neural-rendered scenes, and controlled driving video generation.

Dmitriy Rivkin, Parker Ewen, Lili Gao, Julian Ost, Stefanie Walz, Rasika Kangutkar, Mario Bijelic, Felix Heide• 2026

Related benchmarks

TaskDatasetResultRank
Video Super-ResolutionUDM10
PSNR26.7
48
Video Super-ResolutionSPMCS
PSNR23.67
35
Video Super-ResolutionMVSR4x
PSNR22.55
22
Video Super-ResolutionYouHQ40
PSNR24.58
18
Video Super-ResolutionRealVSR
PSNR22.43
18
Neural Novel View SynthesisDL3DV-Benchmark (test)
FID11.209
8
Controlled Driving Video GenerationWaymo Open Dataset
PSNR29.49
2
Video InpaintingDL3DV
FID40.948
2
Video InpaintingWaymo
FID27.057
2
Video InpaintingROVI
FID27.547
2
Showing 10 of 11 rows

Other info

Follow for update