CAVER: Cross-Modal View-Mixed Transformer for Bi-Modal Salient Object Detection
About
Most of the existing bi-modal (RGB-D and RGB-T) salient object detection methods utilize the convolution operation and construct complex interweave fusion structures to achieve cross-modal information integration. The inherent local connectivity of the convolution operation constrains the performance of the convolution-based methods to a ceiling. In this work, we rethink these tasks from the perspective of global information alignment and transformation. Specifically, the proposed \underline{c}ross-mod\underline{a}l \underline{v}iew-mixed transform\underline{er} (CAVER) cascades several cross-modal integration units to construct a top-down transformer-based information propagation path. CAVER treats the multi-scale and multi-modal feature integration as a sequence-to-sequence context propagation and update process built on a novel view-mixed attention mechanism. Besides, considering the quadratic complexity w.r.t. the number of input tokens, we design a parameter-free patch-wise token re-embedding strategy to simplify operations. Extensive experimental results on RGB-D and RGB-T SOD datasets demonstrate that such a simple two-stream encoder-decoder framework can surpass recent state-of-the-art methods when it is equipped with the proposed components. Code and pretrained models will be available at \href{https://github.com/lartpang/CAVER}{the link}.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| RGB-D Salient Object Detection | NLPR (test) | S-measure (Sα)92.9 | 71 | |
| Salient Object Detection | VT5000 | S-Measure0.892 | 50 | |
| RGB-D Salient Object Detection | STERE (test) | S-measure (Sα)0.914 | 45 | |
| Salient Object Detection | VT821 | S-Measure0.891 | 43 | |
| RGB-T Salient Object Detection | VT1000 | S-Measure (S)93.8 | 42 | |
| RGB-T Salient Object Detection | VT821 | S Score0.898 | 42 | |
| RGB-T Salient Object Detection | VT5000 (test) | Sm Score90 | 39 | |
| RGB-T Salient Object Detection | VT1000 (test) | S-Measure93.8 | 39 | |
| RGB-T Salient Object Detection | VT821 (test) | Sm0.898 | 39 | |
| RGB-D Salient Object Detection | SIP (test) | S-measure (Sα)89.3 | 37 |