DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction
About
Audio-visual saliency prediction can draw support from diverse modality complements, but further performance enhancement is still challenged by customized architectures as well as task-specific loss functions. In recent studies, denoising diffusion models have shown more promising in unifying task frameworks owing to their inherent ability of generalization. Following this motivation, a novel Diffusion architecture for generalized audio-visual Saliency prediction (DiffSal) is proposed in this work, which formulates the prediction problem as a conditional generative task of the saliency map by utilizing input audio and video as the conditions. Based on the spatio-temporal audio-visual features, an extra network Saliency-UNet is designed to perform multi-modal attention modulation for progressive refinement of the ground-truth saliency map from the noisy map. Extensive experiments demonstrate that the proposed DiffSal can achieve excellent performance across six challenging audio-visual benchmarks, with an average relative improvement of 6.3\% over the previous state-of-the-art results by six metrics.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| No-Reference Video Quality Assessment | LIVE-VQC | SRCC0.874 | 50 | |
| No-Reference Video Quality Assessment | YouTube-UGC | SRCC0.884 | 47 | |
| No-Reference Video Quality Assessment | KoNViD-1k | SRCC0.867 | 42 | |
| Video Quality Assessment | LIVE-VQC, KoNViD-1k, YouTube-UGC (Weighted Average) | SROCC0.876 | 23 | |
| No-Reference Video Quality Assessment | LSVQ | PLCC0.847 | 13 | |
| Saliency Prediction | DHF1K (val) | NSS3.066 | 7 | |
| Saliency Prediction | DIEM (val) | NSS2.65 | 7 |