Minimalistic Video Saliency Prediction via Efficient Decoder & Spatio Temporal Action Cues

About

This paper introduces ViNet-S, a 36MB model based on the ViNet architecture with a U-Net design, featuring a lightweight decoder that significantly reduces model size and parameters without compromising performance. Additionally, ViNet-A (148MB) incorporates spatio-temporal action localization (STAL) features, differing from traditional video saliency models that use action classification backbones. Our studies show that an ensemble of ViNet-S and ViNet-A, by averaging predicted saliency maps, achieves state-of-the-art performance on three visual-only and six audio-visual saliency datasets, outperforming transformer-based models in both parameter efficiency and real-time performance, with ViNet-S reaching over 1000fps.

Rohit Girmaji, Siddharth Jain, Bhav Beri, Sarthak Bansal, Vineet Gandhi• 2025

Related benchmarks

Task	Dataset	Result
No-Reference Video Quality Assessment	LIVE-VQC	SRCC0.883	50
No-Reference Video Quality Assessment	YouTube-UGC	SRCC0.887	47
No-Reference Video Quality Assessment	KoNViD-1k	SRCC0.872	42
Video Quality Assessment	LIVE-VQC, KoNViD-1k, YouTube-UGC (Weighted Average)	SROCC0.882	23
No-Reference Video Quality Assessment	LSVQ	PLCC0.85	13
Driver Attention Prediction	DADA-2000	KLD1.719	11
Saliency Prediction	DIEM (val)	NSS2.732	7
Saliency Prediction	DHF1K (val)	NSS3.008	7
Driver Attention Prediction	DriverGaze360 (test)	KLD1.251	6
Saliency Prediction	Video Ads (test)	KL Divergence1.252	5

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord