Minimalistic Video Saliency Prediction via Efficient Decoder & Spatio Temporal Action Cues
About
This paper introduces ViNet-S, a 36MB model based on the ViNet architecture with a U-Net design, featuring a lightweight decoder that significantly reduces model size and parameters without compromising performance. Additionally, ViNet-A (148MB) incorporates spatio-temporal action localization (STAL) features, differing from traditional video saliency models that use action classification backbones. Our studies show that an ensemble of ViNet-S and ViNet-A, by averaging predicted saliency maps, achieves state-of-the-art performance on three visual-only and six audio-visual saliency datasets, outperforming transformer-based models in both parameter efficiency and real-time performance, with ViNet-S reaching over 1000fps.
Rohit Girmaji, Siddharth Jain, Bhav Beri, Sarthak Bansal, Vineet Gandhi• 2025
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| No-Reference Video Quality Assessment | LIVE-VQC | SRCC0.883 | 50 | |
| No-Reference Video Quality Assessment | YouTube-UGC | SRCC0.887 | 47 | |
| No-Reference Video Quality Assessment | KoNViD-1k | SRCC0.872 | 42 | |
| Video Quality Assessment | LIVE-VQC, KoNViD-1k, YouTube-UGC (Weighted Average) | SROCC0.882 | 23 | |
| No-Reference Video Quality Assessment | LSVQ | PLCC0.85 | 13 | |
| Saliency Prediction | DIEM (val) | NSS2.732 | 7 | |
| Saliency Prediction | DHF1K (val) | NSS3.008 | 7 | |
| Driver Attention Prediction | DriverGaze360 (test) | KLD1.251 | 6 | |
| Driver Attention Prediction | DADA-2000 | KLD1.719 | 6 |
Showing 9 of 9 rows