Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Minimalistic Video Saliency Prediction via Efficient Decoder & Spatio Temporal Action Cues

About

This paper introduces ViNet-S, a 36MB model based on the ViNet architecture with a U-Net design, featuring a lightweight decoder that significantly reduces model size and parameters without compromising performance. Additionally, ViNet-A (148MB) incorporates spatio-temporal action localization (STAL) features, differing from traditional video saliency models that use action classification backbones. Our studies show that an ensemble of ViNet-S and ViNet-A, by averaging predicted saliency maps, achieves state-of-the-art performance on three visual-only and six audio-visual saliency datasets, outperforming transformer-based models in both parameter efficiency and real-time performance, with ViNet-S reaching over 1000fps.

Rohit Girmaji, Siddharth Jain, Bhav Beri, Sarthak Bansal, Vineet Gandhi• 2025

Related benchmarks

TaskDatasetResultRank
No-Reference Video Quality AssessmentLIVE-VQC
SRCC0.883
50
No-Reference Video Quality AssessmentYouTube-UGC
SRCC0.887
47
No-Reference Video Quality AssessmentKoNViD-1k
SRCC0.872
42
Video Quality AssessmentLIVE-VQC, KoNViD-1k, YouTube-UGC (Weighted Average)
SROCC0.882
23
No-Reference Video Quality AssessmentLSVQ
PLCC0.85
13
Saliency PredictionDIEM (val)
NSS2.732
7
Saliency PredictionDHF1K (val)
NSS3.008
7
Driver Attention PredictionDriverGaze360 (test)
KLD1.251
6
Driver Attention PredictionDADA-2000
KLD1.719
6
Showing 9 of 9 rows

Other info

Follow for update