SwinNet: Swin Transformer drives edge-aware RGB-D and RGB-T salient object detection

About

Convolutional neural networks (CNNs) are good at extracting contexture features within certain receptive fields, while transformers can model the global long-range dependency features. By absorbing the advantage of transformer and the merit of CNN, Swin Transformer shows strong feature representation ability. Based on it, we propose a cross-modality fusion model SwinNet for RGB-D and RGB-T salient object detection. It is driven by Swin Transformer to extract the hierarchical features, boosted by attention mechanism to bridge the gap between two modalities, and guided by edge information to sharp the contour of salient object. To be specific, two-stream Swin Transformer encoder first extracts multi-modality features, and then spatial alignment and channel re-calibration module is presented to optimize intra-level cross-modality features. To clarify the fuzzy boundary, edge-guided decoder achieves inter-level cross-modality fusion under the guidance of edge features. The proposed model outperforms the state-of-the-art models on RGB-D and RGB-T datasets, showing that it provides more insight into the cross-modality complementarity task.

Zhengyi Liu, Yacheng Tan, Qian He, Yun Xiao• 2022

Related benchmarks

Task	Dataset	Result
RGB-D Salient Object Detection	STERE	S-measure (Sα)0.919	208
RGB-D Salient Object Detection	SIP	S-measure (Sα)0.911	124
Saliency Object Detection	SIP	--	79
RGB-D Saliency Detection	NLPR	Max F-beta0.936	65
RGB-D Salient Object Detection	NJUD	S-measure92	54
Salient Object Detection	NJUD	--	52
Salient Object Detection	NLPR	--	52
RGB-T Salient Object Detection	VT821	S Score0.904	51
RGB-T Salient Object Detection	VT1000	S-Measure (S)93.8	51
Salient Object Detection	VT5000	S-Measure0.912	50

Showing 10 of 33 rows

Other info

Follow for update

@wizwand_team Discord