Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

TASED-Net: Temporally-Aggregating Spatial Encoder-Decoder Network for Video Saliency Detection

About

TASED-Net is a 3D fully-convolutional network architecture for video saliency detection. It consists of two building blocks: first, the encoder network extracts low-resolution spatiotemporal features from an input clip of several consecutive frames, and then the following prediction network decodes the encoded features spatially while aggregating all the temporal information. As a result, a single prediction map is produced from an input clip of multiple frames. Frame-wise saliency maps can be predicted by applying TASED-Net in a sliding-window fashion to a video. The proposed approach assumes that the saliency map of any frame can be predicted by considering a limited number of past frames. The results of our extensive experiments on video saliency detection validate this assumption and demonstrate that our fully-convolutional model with temporal aggregation method is effective. TASED-Net significantly outperforms previous state-of-the-art approaches on all three major large-scale datasets of video saliency detection: DHF1K, Hollywood2, and UCFSports. After analyzing the results qualitatively, we observe that our model is especially better at attending to salient moving objects.

Kyle Min, Jason J. Corso• 2019

Related benchmarks

TaskDatasetResultRank
Video saliency predictionDHF1K (test)
AUC-J0.895
89
Video saliency predictionHollywood-2 (test)
SIM0.507
83
Video saliency predictionUCF Sports (test)
SIM0.469
71
Video saliency predictionDHF1K
AUC-J0.895
51
Saliency PredictionDIEM (test)
SIM0.461
28
Driver Visual Attention PredictionTrafficGaze (test)
KLD1.43
16
Driver Visual Attention PredictionDADA 2000 (test)
KLD1.78
15
Saliency PredictionPVS-HM
CC0.651
15
Driver Visual Attention PredictionBDD-A (test)
KLD1.24
15
Saliency PredictionSport360
CC0.352
15
Showing 10 of 23 rows

Other info

Follow for update