Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

VST++: Efficient and Stronger Visual Saliency Transformer

About

While previous CNN-based models have exhibited promising results for salient object detection (SOD), their ability to explore global long-range dependencies is restricted. Our previous work, the Visual Saliency Transformer (VST), addressed this constraint from a transformer-based sequence-to-sequence perspective, to unify RGB and RGB-D SOD. In VST, we developed a multi-task transformer decoder that concurrently predicts saliency and boundary outcomes in a pure transformer architecture. Moreover, we introduced a novel token upsampling method called reverse T2T for predicting a high-resolution saliency map effortlessly within transformer-based structures. Building upon the VST model, we further propose an efficient and stronger VST version in this work, i.e. VST++. To mitigate the computational costs of the VST model, we propose a Select-Integrate Attention (SIA) module, partitioning foreground into fine-grained segments and aggregating background information into a single coarse-grained token. To incorporate 3D depth information with low cost, we design a novel depth position encoding method tailored for depth maps. Furthermore, we introduce a token-supervised prediction loss to provide straightforward guidance for the task-related tokens. We evaluate our VST++ model across various transformer-based backbones on RGB, RGB-D, and RGB-T SOD benchmark datasets. Experimental results show that our model outperforms existing methods while achieving a 25% reduction in computational costs without significant performance compromise. The demonstrated strong ability for generalization, enhanced performance, and heightened efficiency of our VST++ model highlight its potential.

Nian Liu, Ziyang Luo, Ni Zhang, Junwei Han• 2023

Related benchmarks

TaskDatasetResultRank
RGB-D Salient Object DetectionSTERE
S-measure (Sα)0.921
198
Salient Object DetectionPASCAL-S--
186
RGB-D Salient Object DetectionSIP
S-measure (Sα)0.904
124
RGB-D Salient Object DetectionNLPR (test)
S-measure (Sα)93.3
71
RGB-D Saliency DetectionNLPR
Max F-beta0.925
65
RGB-D Salient Object DetectionNJUD
S-measure92.8
54
Salient Object DetectionVT5000
S-Measure0.895
50
RGB-D Salient Object DetectionSTERE (test)
S-measure (Sα)0.916
45
RGB-D Salient Object DetectionSIP (test)
S-measure (Sα)90.3
37
Salient Object DetectionVT821
S-Measure0.894
36
Showing 10 of 19 rows

Other info

Follow for update