TASED-Net: Temporally-Aggregating Spatial Encoder-Decoder Network for Video Saliency Detection
About
TASED-Net is a 3D fully-convolutional network architecture for video saliency detection. It consists of two building blocks: first, the encoder network extracts low-resolution spatiotemporal features from an input clip of several consecutive frames, and then the following prediction network decodes the encoded features spatially while aggregating all the temporal information. As a result, a single prediction map is produced from an input clip of multiple frames. Frame-wise saliency maps can be predicted by applying TASED-Net in a sliding-window fashion to a video. The proposed approach assumes that the saliency map of any frame can be predicted by considering a limited number of past frames. The results of our extensive experiments on video saliency detection validate this assumption and demonstrate that our fully-convolutional model with temporal aggregation method is effective. TASED-Net significantly outperforms previous state-of-the-art approaches on all three major large-scale datasets of video saliency detection: DHF1K, Hollywood2, and UCFSports. After analyzing the results qualitatively, we observe that our model is especially better at attending to salient moving objects.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video saliency prediction | DHF1K (test) | AUC-J0.895 | 89 | |
| Video saliency prediction | Hollywood-2 (test) | SIM0.507 | 83 | |
| Video saliency prediction | UCF Sports (test) | SIM0.469 | 71 | |
| Saliency Prediction | DIEM (test) | SIM0.461 | 28 | |
| Saliency Prediction | PVS-HM | CC0.651 | 15 | |
| Saliency Prediction | Sport360 | CC0.352 | 15 | |
| Saliency Prediction | DHF1K | Model Size (MB)82 | 12 | |
| Video saliency prediction | UCF Sports | NSS2.92 | 11 | |
| Saliency Prediction | VR-EyeTracking | CC0.201 | 9 | |
| Video saliency prediction | Coutrot1 (test) | CC0.479 | 7 |