Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction

About

Audio-visual saliency prediction can draw support from diverse modality complements, but further performance enhancement is still challenged by customized architectures as well as task-specific loss functions. In recent studies, denoising diffusion models have shown more promising in unifying task frameworks owing to their inherent ability of generalization. Following this motivation, a novel Diffusion architecture for generalized audio-visual Saliency prediction (DiffSal) is proposed in this work, which formulates the prediction problem as a conditional generative task of the saliency map by utilizing input audio and video as the conditions. Based on the spatio-temporal audio-visual features, an extra network Saliency-UNet is designed to perform multi-modal attention modulation for progressive refinement of the ground-truth saliency map from the noisy map. Extensive experiments demonstrate that the proposed DiffSal can achieve excellent performance across six challenging audio-visual benchmarks, with an average relative improvement of 6.3\% over the previous state-of-the-art results by six metrics.

Junwen Xiong, Peng Zhang, Tao You, Chuanyue Li, Wei Huang, Yufei Zha• 2024

Related benchmarks

TaskDatasetResultRank
No-Reference Video Quality AssessmentLIVE-VQC
SRCC0.874
50
No-Reference Video Quality AssessmentYouTube-UGC
SRCC0.884
47
No-Reference Video Quality AssessmentKoNViD-1k
SRCC0.867
42
Video Quality AssessmentLIVE-VQC, KoNViD-1k, YouTube-UGC (Weighted Average)
SROCC0.876
23
No-Reference Video Quality AssessmentLSVQ
PLCC0.847
13
Saliency PredictionDHF1K (val)
NSS3.066
7
Saliency PredictionDIEM (val)
NSS2.65
7
Showing 7 of 7 rows

Other info

Follow for update