Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Learning Joint Spatial-Temporal Transformations for Video Inpainting

About

High-quality video inpainting that completes missing regions in video frames is a promising yet challenging task. State-of-the-art approaches adopt attention models to complete a frame by searching missing contents from reference frames, and further complete whole videos frame by frame. However, these approaches can suffer from inconsistent attention results along spatial and temporal dimensions, which often leads to blurriness and temporal artifacts in videos. In this paper, we propose to learn a joint Spatial-Temporal Transformer Network (STTN) for video inpainting. Specifically, we simultaneously fill missing regions in all input frames by self-attention, and propose to optimize STTN by a spatial-temporal adversarial loss. To show the superiority of the proposed model, we conduct both quantitative and qualitative evaluations by using standard stationary masks and more realistic moving object masks. Demo videos are available at https://github.com/researchmm/STTN.

Yanhong Zeng, Jianlong Fu, Hongyang Chao• 2020

Related benchmarks

TaskDatasetResultRank
Video InpaintingDAVIS (test)
PSNR30.67
54
Semantic segmentationKITTI-360 (test)
mIoU66.97
25
Video InpaintingDAVIS
PSNR28.891
22
Video InpaintingYoutube-VOS
PSNR28.993
15
Video InpaintingYouTube-VOS 720P 2018 (test)
PSNR29.9172
14
Video InpaintingDAVIS 480P 2017 (test)
PSNR26.5453
14
Video InpaintingDAVIS object mask (test)
PSNR32.83
14
Video InpaintingYoutube-VOS square mask (test)
PSNR32.49
14
Video InpaintingDAVIS square mask (test)
PSNR30.54
14
Video InpaintingHQVI
PSNR29.64
13
Showing 10 of 26 rows

Other info

Code

Follow for update