UBATrack: Spatio-Temporal State Space Model for General Multi-Modal Tracking

About

Multi-modal object tracking has attracted considerable attention by integrating multiple complementary inputs (e.g., thermal, depth, and event data) to achieve outstanding performance. Although current general-purpose multi-modal trackers primarily unify various modal tracking tasks (i.e., RGB-Thermal infrared, RGB-Depth or RGB-Event tracking) through prompt learning, they still overlook the effective capture of spatio-temporal cues. In this work, we introduce a novel multi-modal tracking framework based on a mamba-style state space model, termed UBATrack. Our UBATrack comprises two simple yet effective modules: a Spatio-temporal Mamba Adapter (STMA) and a Dynamic Multi-modal Feature Mixer. The former leverages Mamba's long-sequence modeling capability to jointly model cross-modal dependencies and spatio-temporal visual cues in an adapter-tuning manner. The latter further enhances multi-modal representation capacity across multiple feature dimensions to improve tracking robustness. In this way, UBATrack eliminates the need for costly full-parameter fine-tuning, thereby improving the training efficiency of multi-modal tracking algorithms. Experiments show that UBATrack outperforms state-of-the-art methods on RGB-T, RGB-D, and RGB-E tracking benchmarks, achieving outstanding results on the LasHeR, RGBT234, RGBT210, DepthTrack, VOT-RGBD22, and VisEvent datasets.

Qihua Liang, Liang Chen, Yaozong Zheng, Jian Nong, Zhiyi Mo, Bineng Zhong• 2026

Related benchmarks

Task	Dataset	Result
RGB-D Object Tracking	VOT-RGBD 2022 (public challenge)	EAO77.8	263
RGB-T Tracking	LasHeR (test)	PR76	257
RGB-D Object Tracking	DepthTrack (test)	Precision67.7	181
RGB-E Tracking	VisEvent	MPR79.7	46
RGB-T Tracking	RGBT210 (test)	--	32
RGB-T Tracking	RGBT234 17 (test)	Success Rate (MSR)70.1	17
RGB-T Tracking	LasHeR 1.0 (test)	Success Rate60.1	4

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord