Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

DESSERT: Diffusion-based Event-driven Single-frame Synthesis via Residual Training

About

Video frame prediction extrapolates future frames from previous frames, but suffers from prediction errors in dynamic scenes due to the lack of information about the next frame. Event cameras address this limitation by capturing per-pixel brightness changes asynchronously with high temporal resolution. Prior research on event-based video frame prediction has leveraged motion information from event data, often by predicting event-based optical flow and reconstructing frames via pixel warping. However, such approaches introduce holes and blurring when pixel displacement is inaccurate. To overcome this limitation, we propose DESSERT, a diffusion-based event-driven single-frame synthesis framework via residual training. Leveraging a pre-trained Stable Diffusion model, our method is trained on inter-frame residuals to ensure temporal consistency. The training pipeline consists of two stages: (1) an Event-to-Residual Alignment Variational Autoencoder (ER-VAE) that aligns the event frame between anchor and target frames with the corresponding residual, and (2) a diffusion model that denoises the residual latent conditioned on event data. Furthermore, we introduce Diverse-Length Temporal (DLT) augmentation, which improves robustness by training on frame segments of varying temporal lengths. Experimental results demonstrate that our method outperforms existing event-based reconstruction, image-based video frame prediction, event-based video frame prediction, and one-sided event-based video frame interpolation methods, producing sharper and more temporally consistent frame synthesis.

Jiyun Kong, Jun-Hyuk Kim, Jong-Seok Lee• 2025

Related benchmarks

TaskDatasetResultRank
Video Frame PredictionGoPro 15 frames
PSNR19.73
10
Video Frame PredictionBS-ERGB 1 frame (test)
PSNR25.27
10
Video Frame PredictionBS-ERGB 3 frames (test)
PSNR24.21
10
Video Frame PredictionHS-ERGB 7 frames (test)
PSNR29.74
10
Video Frame PredictionGoPro 7 frames
PSNR18.85
10
Event-based Video Frame PredictionBS-ERGB
GFLOPs1.03e+5
2
Event-based Video Frame InterpolationBS-ERGB--
2
Showing 7 of 7 rows

Other info

Follow for update